A low-cost high-speed twin-prefetching DSP-based shared-memory system for real-time image processing applications by Christou, Charalambos Stephanou
New Jersey Institute of Technology
Digital Commons @ NJIT
Dissertations Theses and Dissertations
Spring 1998
A low-cost high-speed twin-prefetching DSP-based
shared-memory system for real-time image
processing applications
Charalambos Stephanou Christou
New Jersey Institute of Technology
Follow this and additional works at: https://digitalcommons.njit.edu/dissertations
Part of the Electrical and Electronics Commons
This Dissertation is brought to you for free and open access by the Theses and Dissertations at Digital Commons @ NJIT. It has been accepted for
inclusion in Dissertations by an authorized administrator of Digital Commons @ NJIT. For more information, please contact
digitalcommons@njit.edu.
Recommended Citation
Christou, Charalambos Stephanou, "A low-cost high-speed twin-prefetching DSP-based shared-memory system for real-time image
processing applications" (1998). Dissertations. 945.
https://digitalcommons.njit.edu/dissertations/945
Copyright Warning & Restrictions
The copyright law of the United States (Title 17, United
States Code) governs the making of photocopies or other
reproductions of copyrighted material.
Under certain conditions specified in the law, libraries and
archives are authorized to furnish a photocopy or other
reproduction. One of these specified conditions is that the
photocopy or reproduction is not to be “used for any
purpose other than private study, scholarship, or research.”
If a, user makes a request for, or later uses, a photocopy or
reproduction for purposes in excess of “fair use” that user
may be liable for copyright infringement,
This institution reserves the right to refuse to accept a
copying order if, in its judgment, fulfillment of the order
would involve violation of copyright law.
Please Note: The author retains the copyright while the
New Jersey Institute of Technology reserves the right to
distribute this thesis or dissertation
Printing note: If you do not wish to print this page, then select
“Pages from: first page # to: last page #” on the print dialog screen
The Van Houten library has removed some of
the personal information and all signatures from
the approval page and biographical sketches of
theses and dissertations in order to protect the
identity of NJIT graduates and faculty.
ABSTRACT
A LOW-COST HIGH-SPEED TWIN-PREFETCHING DSP-BASED




This dissertation introduces, investigates, and evaluates a low-cost high-speed
twin-prefetching DSP-based bus-interconnected shared-memory system for real-time
image processing applications. The proposed architecture can effectively support 32
DSPs in contrast to a maximum of 4 DSPs supported by existing DSP-based bus-
interconnected systems. This significant enhancement is achieved by introducing two
small programmable fast memories (Twins) between the processor and the shared bus
interconnect. While one memory is transferring data from/to the shared memory, the
other is supplying the core processor with data. The elimination of the traditional direct
linkage of the shared bus and processor data bus makes feasible the utilization of a wider
shared bus i.e., shared bus width becomes independent of the data bus width of the
processors. The fast prefetching memories and the wider shared bus provide additional
bus bandwidth into the system, which eliminates large memory latencies; such memory
latencies constitute the major drawback for the performance of shared-memory
multiprocessors. Furthermore, in contrast to existing DSP-based uniprocessor or
multiprocessor systems the proposed architecture does not require all data to be placed on
on-chip or off-chip expensive fast memory in order to reach or maintain peak
performance. Further, it can maintain peak performance regardless of whether the
processed image is small or large.
The performance of the proposed architecture has been extensively investigated
executing computationally intensive applications such as real-time high-resolution image
processing. The effect of a wide variety of hardware design parameters on performance
has been examined. More specifically tables and graphs comprehensively analyze the
performance of 1, 2, 4, 8, 16, 32 and 64 DSP-based systems, for a wide variety of shared
data interconnect widths such as 32, 64, 128, 256 and 512. In addition, the effect of the
wide variance of temporal and spatial locality (present in different applications) on the
multiprocessor's execution time is investigated and analyzed. Finally, the prefetching
cache-size was varied from a few kilobytes to 4 Mbytes and the corresponding effect on
the execution time was investigated. Our performance analysis has clearly showed that
the execution time converges to a shallow minimum i.e., it is not sensitive to the size of
the prefetching cache. The significance of this observation is that near optimum
performance can be achieved with a small (16 to 300 Kbytes) amount of prefetching
cache.
A LOW-COST HIGH-SPEED TWIN-PREFETCHING DSP-BASED





Submitted to the Faculty of
New Jersey Institute of Technology
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Department of Electrical and Computer Engineering
May 1998
Copyright © 1998 by Charalambos Stephanou Christou
ALL RIGHTS RESERVED
APPROVAL PAGE
A LOW-COST HIGH-SPEED TWIN-PREFETCHING DSP-BASED
SHARED-MEMORY SYSTEM FOR REAL-TIME
IMAGE PROCESSING APPLICATIONS
Charalambos Stephanou Christou
Dr. Constantine N. Manikopoulds, Dissertation Advisor 	 Date
Associate Professor of Electrical and Computer Engineering, NJIT
Dr. Dennis Karvelas, Committee Member
	 Date
Director of M.S. in Telecommunications, Computer Science Dept., NJIT
Dr. Edwin Hou, Committee Member
	 Date
Associate Professor of Electrical and Computer Engineering, NJIT
Dr. John D. Carpinelli, Committee Member
	 Date
Associate Professor of Electrical and Computer Engineering and Computer
and Information Science, NJIT
_ 	 . 
Dr. Slawomir Piatek, Committee Member
	 Date
Special Lecturer, Physics Department, NJIT
BIOGRAPHICAL SKETCH
Author: Charalambos Stephanou. Christou
Degree: Doctor of Philosophy in Electrical Engineering
Date:
	 May 1998
Undergraduate and Graduate Education:
• Doctor of Philosophy in Electrical Engineering,
New Jersey Institute of Technology, Newark, NJ, 1998
• Master of Science in Electrical Engineering,
New Jersey Institute of Technology, Newark, NJ, 1988
• Bachelor of Science in Electrical Engineering,
New Jersey Institute of Technology, Newark, NJ, 1987
Major:	 Electrical Engineering
Publications:
1. Charalambos S. Christou and Constantine N. Manikopoulos, "Wider Shared-Bus - A
Solution for Increasing the Number of Effectively Supported Processors in a
Multiprocessor System," to be submitted for publication.
2. Charalambos S. Christou and Constantine N. Manikopoulos, "A Low-Cost DSP-
Based Shared-Memory Multiprocessor System for Real-Time Applications," to be
submitted for publication
3. Charalambos S. Christou and John D. Carpinelli, "Determination of Optimal Cache
Size through Modeling and Simulation," Proceed. of Twentieth Pittsburgh
Conference on Modeling and Simulation, pp. 1267-1271, May 1989.
iv
This dissertation is dedicated to
my parents Stephanos and Evagelia,
my wife Filanthi, my sister Anthoula,
and my brothers Andreas and Christos
ACKNOWLEDGMENT
I would like to express my gratitude to my advisor, Dr. Constantine N.
Manikopoulos, for his guidance throughout the course of the dissertation research.
Special thanks to Dr. Dennis Karvelas, Dr. Edwin Hou, Dr. John D. Carpinelli,
and Dr. Slawomir Piatek for serving as members of the dissertation committee.
Special thanks to my friends Socrates Ioannides, Tasos Kondos, loannis Tzathas,
Lazaros Vastardos, Vaggelis Tsimis, Christos Banias, Michalis and Nancy Nikolaou,
Antonis Tjirkallis, Jim Koroniades, Evzonas Andreas, Akis Kleanthous, Michalis Sideras,
Christos Sideras, Giannis Milonas, Nikos Antoniou, Kyriakos Mouskos, Giorgos
Antoniou, Giorgos Donos, Bambos and Maria Pafiti, Lefteris and Andri Panagiotou,
Kostas and Maria Tsatsos, Panagiotis and Maria Vastardos, Kyriakos and Themi Vyras,
and Petros and Andri Kyriakou for their love and enduring support.
Finally, I would like to thank very much my wife Filanthi, my mother Evagelia,
my father Stefanos, my sister Anthoula, and my brothers Andreas and Christos for their
spiritual support throughout my academic endeavors. I can hardly find words to express




1 INTRODUCTION 	 1
1.1 Parallel Processing Systems 	 1
1.1.1 Shared-Memory Multiprocessors 	 2
1.1.1.1 Caches 	 4
1.1.1.2 Dynamic Interconnection Networks 	 5
1.1.1.3 Prefetching  	 6
1.1.2 Message-Passing Multicomputers 	 7
1.1.3 Image Processing 	 9
1.2 Motivation, Objectives, and Contributions.. 	 11
1.3 Outline 	 12
2 TWIN-PREFETCHING DSP-BASED SHARED-MEMORY SYSTEM 	 13
2.1 The ADSP-21060.. 	 13
2.1.1 Multiprocessing 	 16
2.1.2 Dual-Ported Internal Memory 	 17
	
2.2 Data Memory    .18
2.2.1 The TTCs 	 18
2.3 Host Processor 	 20
2.4 Theoretical Analysis  	 20
2.4.1 Partitioning Images into Cache Prefetching Segments 	 21





2.4.2 Categories of Applications 	 24
2.5 Simulation.. 	
3 PERFORMANCE ANALYSIS 	  78
3.1 Methodology for Performance Evaluation 	 79
3.2 Two-Dimensional Convolution  	 33
3.3 Application Parameters.. 	 34
4 EFFECT OF SHARED-BUS-WIDTH ON PERFORMANCE 	 37
4.1 Effect of Shared-Bus-Width on Performance when Template Matrix =2x2... 	 39
4.2 Effect of Shared-Bus-Width on Performance when Template Matrix =3x3 	 44
4.3 Effect of Shared-Bus-Width on Performance when Template Matrix = 6x6. 	 49
4.4 Effect of Shared-Bus-Width on Performance when Template Matrix =9x9... 	 53
4.5 Discussion of Results.. 	 58
5 EFFECT OF TEMPORAL AND SPATIAL LOCALITY ON
PERFORMANCE 	 61
5.1 Effect of Spatial and Temporal Locality on a P-Node Twin-Prefetching
Multiprocessor when Shared-Bus-Width is 32 	 63
5.1.1 Performance of One DSP Processor Without Twin-Prefetching Cache
Memories 	 64
5.2 Effect of Spatial and Temporal Locality on a P-Node Twin-Prefetching
Multiprocessor when Shared-Bus-Width is 64 
	 68
5.3 Effect of Spatial and Temporal Locality on a P-Node Twin-Prefetching
Multiprocessor when Shared-Bus-Width is 128
	 72
5.4 Effect of Spatial and Temporal Locality on a P-Node Twin-Prefetching






5.5 Effect of Spatial and Temporal Locality on a P-Node Twin-Prefetching
Multiprocessor when Shared-Bus-Width is 512 	 80
5.6 Communication Overhead 	 84
6 DETERMINATION OF THE OPTIMAL SIZE OF PREFETCHING
CACHES  	 86
6.1 Shared Bus Contention Cases..  	 87
6.2 Selection of Results 	 89
6.3 Analysis of Results 	 91
6.4 Optimal Selection of Prefetching Cache Size when P=1. 	 92
6.5 Optimal Selection of Prefetching Cache Size when P=2.. 	 93
6.6 Optimal Selection of Prefetching Cache Size when P=4.. 	 94
6.7 Optimal Selection of Prefetching Cache Size when P=8 	 94
6.8 Optimal Selection of Prefetching Cache Size when P=16 	 95
6.9 Optimal Selection of Prefetching Cache Size when P-32 	 96
6.10 Optimal Selection of Prefetching Cache Size when P=64 	 96
6.11 Discussion of Results..  	 97
7 CONCLUSIONS 	 102
APPENDIX A ASSEMBLER CODE FOR 2D CONVOLUTION
	 105
APPENDIX B EXECUTION TIME AND SPEEDUP VS. SHARED-BUS-
WIDTH 	 107








4-1	 Time vs. P for n1=32 ,64, 128, 256 and 512 when template matrix=2x2 	 40
4-2	 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=2x2.. 	 41
4-3	 Efficiency vs. P for nl32 , 64, 128, 256 and 512 when template matrix=2x2 	 42
4-4	 Speedup vs. Shared-bus-width (n/) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=2x2 	  .43
4-5	 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3 	 45
4-6	 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3 ... . 	 46
4-7	 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3 	 47
4-8	 Speedup vs. Shared-bus-width (nl) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=3x3  	 48
4-9	 Time vs. P for nl=32 ,64, 128, 256 and 512 when template matrix=6x6. 	 49
4-10 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=6x6 	 50
4-11 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix=6x6 	 51
4-12 Speedup vs. Shared-bus-width (n1) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=6x6 
	 52
4-13 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=9x9
	 54
4-14	 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=9x9..
	 55
4-15 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix=9x9
	 56
4-16 Speedup vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=9x9 
	 57





4-18 P-node twin prefetching multiprocessors with system efficiency E>0.90 	 60
5-1	 Execution time vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when nl=39 	 65




5-3	 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=32 	 67
5-4	 Execution time vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when n1=64 	 69
5-5	 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=64 	 70
5-6	 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when ril=64 
	 71
5-7	 Execution time vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when nl=128
	 73
5-8	 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=128 	 74
5-9	 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when n1=128 	 75
5-10 Execution time vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when nl=256
	 77
5-11 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=256 
	 78
5-12 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9






5-13 Execution time vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when nl=519 	 81
5-14 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=512  	 82
5-15 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl-512 	 83
5-16 Execution Time vs. Number of Processors when α=1 (nl=519) 	 85
5-17 Execution Time vs. Number of Processors when a=2 (nl=512) 	 85
6-1	 Prefetching cache size vs. time when P=8, nl=32, and template window=2x2 	 88
6-2	 Prefetching cache size vs. time when P=8, nl=128, and template window=2x2 	 88
6-3	 Prefetching cache size vs. time when P=8, nl=512, and template window=2x2 	 88
6-4	 Optimal prefetching cache size when template window=2x2. 	 98
6-5	 Optimal prefetching cache size when template window=3x3. 	 99
6-6	 Optimal prefetching cache size when template window=6x6. 	 99
6-7	 Optimal prefetching cache size when template window=9x9 	 99
6-8	 Optimal prefetching cache size when P=1 	 100
6-9	 Optimal prefetching cache size when P=2 	 100
6-10 Optimal prefetching cache size when P=4 	 100
6-11 Optimal prefetching cache size when P=8 	 100
6-12 Optimal prefetching cache size when P=16.  	 101





6-14 Optimal prefetching cache size when. P=64. 	 101
6-15 Optimal prefetching cache size.  	 101
B-1	 Execution time vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=2x2. 	  .107
B-2	 Speedup vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=2x2 	 108
B-3	 Execution time vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=3x3 	 109




B-5	 Execution time vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=6x6 	 112
B-6	 Speedup vs. Shared-Bus-Width (770 for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=6x6  	 113
B-7	 Execution time vs. Shared-Bus-Width (nl) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=9x9 
	
114





1-1	 The UMA multiprocessor model 	 2
1-2	 The NUMA multiprocessor model 	 3
1-3	 A multicomputer system 	 7
1-4	 Common topologies for interconnection networks  	 8
2-1	 Twin-prefetching multiprocessor system diagram 	 14
2-2	 Node block diagram 	 14
4-1	 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=2x2. 	 41
4-2	 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=2x2 	 42
4-3	 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix=2x2 	 43
4-4	 Speedup vs. Shared-bus-width (n1) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=2x2 	 44
4-5	 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3 	 45
4-6	 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3 	 46
4-7	 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3 	 47
4-8	 Speedup vs. Shared-bus-width (n1) for P=8, 16, 32, and 64 when template
matrix=3 x3  	 48
4-9	 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=6x6. 	 50
4-10 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=6x6 	 51
4-11 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix=6x6 	 52
4-12 Speedup vs. Shared-bus-width (n1) for P=1, 2, 4, 8, 16, 32, and 64 when





4-13 Time vs. P for nl=32 ,64, 128, 256 and 512 when template matrix=9x9 	 55
4-14	 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=9x9
	 56
4-15 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix =9x9 	 57
4-16 Speedup vs. Shared-Bus-Width (nl) for P=1, 2, 4, 8, 16, 32, and 64 when
template matrix=9x9 	 58
5-1	 Execution time of applications on a uniprocessor system with and without
twin prefetching (nl=39) 	 65
5-2	 Execution time vs. Number of Processors for the template matrix=2x2, 3x3,
6x6, 9x9 when n1=39 	 66
5-3	 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=32 	 67
5-4	 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when n1=32 	 68
5-5	 Execution time vs. Number of Processors for the template matrix=2x2, 3x3,
6x6, 9x9 when nl=64. 	 70
5-6	 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=64 	 71
5-7	 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=64 	 72
5-8	 Execution time vs. Number of Processors for the template matrix=2x2, 3x3,
6x6, 9x9 when nl=128. 	 74
5-9	 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=128  	 . 75
5-10 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when nl=128. 	 76
5-11 Execution time vs. Number of Processors for the template matrix=2x2, 3x3,









5-13 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=256  	 80
5-14 Execution time vs. Number of Processors for the template matrix=2x2, 3x3,
6x6, 9x9 when nl=512    81
5-15 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when n1=512 	 82




B-1	 Execution time vs. Shared-Bus-Width (nl) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=2x2.   107
B-2	 Execution time vs. Shared-Bus-Width (n1) for P=8, 16, 32 and 64 when
template matrix=2x2 	 108












B-6	 Speedup vs. Shared-Bus-Width (nl) for P=8, 16, 32 and 64 when template
matrix=3x3. 	  111
B-7	 Execution time vs. Shared-Bus-Width (nl) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=6x6. 	 112
B-8	 Execution time vs. Shared-Bus-Width (n1) for P=8, 16, 32 and 64 when
template matrix=6x6   113







B-10 Execution time vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64
	
when template matrix=9x9    115
B-11 Execution time vs. Shared-Bus-Width (n1) for P=8, 16, 32 and 64 when
	
template matrix=9x9    115
B-12 Speedup vs. Shared-Bus-Width (nl) for P=1, 2, 4, 8, 16, 32 and 64 when




Parallel processing has had a tremendous impact on many areas of computer application.
High raw computing power of parallel computers now can meet application requirements
that were until recently beyond the capability of conventional computing techniques. One
of the key problems to be solved in the area of parallel processing is the bridging of the
gap between processors' speed and memory latency. Processor performance has increased
dramatically over the past few years while memory latency and bandwidth have
progressed at a much slower pace. Large latencies have considerably reduced the number
of processors, which can be effectively supported in shared memory parallel computers.
The focus of this dissertation is a new cost-effective parallel computer architecture that
reduces memory latency and effectively supports a greater number of processing
elements. This chapter provides an introductory background in this fast growing research
area. The outline of the dissertation as well as the motivations, objectives, and
contributions are presented at the end of the chapter.
1.1 Parallel Processing Systems
A parallel computer, in general, has attributes such as the number of processors (or
number of nodes), the memory system, peripherals, and the interconnection network (a
collection of wires and connectors for data transactions among processors, memory
modules, and peripheral devices). Parallel architectures can be classified in two large
2classes: shared-memory multiprocessors and message-passing multicomputers [32][31].
Multiprocessors have a single, global, shared address space visible to all processors. Any
processor can read or write any word in the address space by moving data from or to a
memory address. Communication is via the shared memory. Multicomputers do not have
a shared memory and must communicate by message passing.
1.1.1 Shared-Memory Multiprocessors
Multiprocessors are also called tightly coupled systems due to the high degree of resource
sharing. Three shared-memory multiprocessor models are primarily used: The uniform-
memory-access (UMA) model, the nonuniform-memory-access (NUMA) model and the
cache-only-memory access (COMA) model [32]. They differ in the way the memory and
other resources are distributed. If the time taken by a processor to access any memory
word in the system is identical the computer is classified as UMA (model shown in
Figure 1.1 The UMA multiprocessor model
Figure 1.1). All processors have equal access time to all memory locations in all shared-
memory modules (marked as SM) under the condition of no network congestion. That is
why it is called a uniform-memory-access model. In the NUMA model in Figure 1.2,
3however, the shared memory is physically distributed to all processors (called local
memories). The collection of all local memories forms a global address space accessible
to all processors. The time to access a remote memory bank is longer than the time to
access a local one (labeled as LM), because a processor has to go through the
interconnection network when accessing the former. The cache-only-memory access
(COMA) [28] is a special case of NUMA machine, in which the distributed main
memories are converted to caches.
Figure 1.2 The NUMA multiprocessor model
Multiprocessor systems are suitable for general-purpose applications where
programmability is the major concern. They preserve the intuitively appealing
programming model provided by a single, linear memory address space. It is the opinion
of many researchers [32][33][34] in the area of parallel processing that the recent failures
of many parallel computing companies are partly due to the difficulty of programming
and lack of software of their supercomputers. Mainly because of programming
4complexities of message-passing multicomputers, the architectural trend for future
general-purpose computers is in favor of shared-memory systems [32][33][37]. Another
significant factor for the observed failure is the price/performance ratio. In fact, Ted
Lewis states in [35] the following: "I could never understand why a 30-processor
multicomputer costs as much as 130 workstations when both contain the same
commodity chips." It should be also mentioned that price/performance ratio is always a
major factor for a successful uniproccessor or multiprocessor system design.
Latency tolerance for memory access is a major limitation in shared-memory
systems [3][4][13] due to the bandwidth constraint. For instance, four to eight modern
processors (with private caches-with or without prefetching), accessing the same memory
module, can easily use up all the available bandwidth [5][37].
1.1.1.1 Caches: Private caches in conjunction with hardware-based cache coherence
maintenance [36] have contributed to the reduction of the ill effects of large memory
access times. Caches are placed between the (fast) processor and (slow) main (shared)
memory and have the basic function to hold regions of recently referenced shared-
memory. References satisfied by the cache (called cache hits) proceed at processor speed;
those unsatisfied (called misses) incur a cache miss penalty to fetch the corresponding
data from the shared memory. Most modern processors must wait, or stall, until the data
arrive. Caches increase system performance not only because data is available to the
processor faster but also because they reduce congestion on the interconnection network,
i.e., a high cache hit rate eliminates some of the traffic that would have otherwise gone
out across the interconnection network to the shared-memory. Reduced traffic increases
5the number of processors that can be effectively supported on the interconnection
network. Caches explain the fact that multiprocessors systems are designed with less
memory bandwidth than the one that would be required by the sum of the individual
CPUs. When planning the design of a multiprocessor system, great consideration should
be given to what kind of applications will be executed on it. The maximum speedup of an
application is bounded by the speed of the interconnection network.
1.1.1.2 Dynamic Interconnection Networks: Multiple processors in shared-memory
systems communicate and access the common memory through a dynamic
interconnection network. This network implements all communication patterns based on
program demands using switches and arbiters along the connecting paths to provide the
dynamic connectivity. In increasing order of cost and performance, dynamic connection
networks include bus systems, multistage interconnection networks (MIN), and crossbar
switch networks. The performance is indicated by the network bandwidth, data transfer
rate, network latency, and communication patterns supported.
The simplest and least costly way to construct a multiprocessor system is to
connect the processors on a shared bus. In the past, commercial releases of bus based
multiprocessors supported as many as 32 processors. The advent of high-performance
ultra-fast processors has reduced that number to four [11] or eight [7]. For instance, each
node of the Stanford DASH multiprocessor [14] is a bus-based cluster and supports only
four high performance RISC processors. The amount of data which a shared bus can
deliver to a computer system depends on its speed (clock rate), memory access time, and
data-bus width. Conventional designs of shared-bus shared memory systems, naturally set
6the width of the bus to the same size of the data bus width of the processor. The shared
bus can only handle one transaction at a time, employing a single source; therefore
limiting the amount of total data transferred per transaction to the data bus width of the
processor. The proposed twin-prefetching system separates the traditional linkage of
shared bus and processor bus and enables the utilization of a wider shared bus. A wider
shared bus pumps into the system additional bandwidth and supports a greater number of
processors. It makes the system scalable (unheard of for shared-bus shared memory
system) since the width of the shared bus becomes independent from the processor bus
width, i.e., bandwidth is increased by increasing the shared bus width.
The crossbar and multistage networks are more complex and provide higher
bandwidth for higher cost. They are to be used if target performance cannot be achieved
through bus interconnect [10]. Digital signal processing (DSP-based) parallel computers
built for image processing [21][25][27] have no other choice but to employ crossbar and
multistage switches in order to cope with the immense processing load.
1.1.1.3 Prefetching: Recent research has shown that prefetching in caches further
reduces memory latencies and increases system performance [I ][2][3][4][8][9][30]. Any
prefetching scheme has as a goal to reduce the processor stall time by bringing data into
the cache before they are referenced. Prefetching approaches proposed in the literature are
software or hardware based. Software controlled prefetching schemes rely on the
programmer/compiler to insert prefetch instructions prior to the instructions that trigger a
miss. Hardware controlled prefetching schemes detect accesses with regular patterns and
issue prefetches at run time. T. Mowry and A. Gupta [3] (software-controlled
7prefetching), for example, report as much as 150% of performance improvement when
applications with regular data access patterns are executed on the Stanford DASH
multiprocessor [14]. Fredrik et. al. [4] report a 78% reduction in read miss by employing
a simple hardware controlled prefetching technique which relies on an automatic prefetch
of multiple consecutive blocks that follow the one that caused the miss in the cache. They
are exploiting spatial locality of data.
1.1.2 Message-Passing Multicomputers
A message-passing multicomputer consists of multiple nodes interconnected by a point to
point network. Each node is an autonomous computer including a processor, a private
local memory, and possibly disks or I/O peripherals, as modeled in Figure 1.3. Internode
communication is carried out by passing messages through the network while observing
certain communication protocols. Such actions may involve multiple links (i.e., physical
connections between nodes) and nodes, if the source is not directly connected to the
destination.
Figure 1.3 A multicomputer system
8Some common network topologies in constructing interconnection networks for
multicomputers are, as shown in Figure 1.4, binary tree, star, ring, mesh, hypercube, etc.
They are also called static connection networks because all links between nodes are fixed
after a network is built.
(a) Binary tree	 (b) Star	 c) Ring
(d) Mesh	 (e) Hypercube
Figure 1.4 Common topologies for interconnection networks
Multicomputers achieve better scalable performance due to their distributed
processor/memory nodes. However, message passing imposes a hardship on
programmers to distribute the computations and data sets over the nodes or to establish
efficient communication among nodes. Until intelligent compilers and efficient
distributed operating systems become available, multicomputers will continue to lack
programmability. Chandra et. al. [6] studied the strengths and weaknesses of the two
fundamental mechanisms of message-passing and shared-memory by comparing the
9performance of equivalent, well-written message-passing and shared-memory programs
running on similar hardware. Each application program was produced in two versions and
its performance was measured on closely-related simulators of a message-passing and a
shared-memory machine. They found that three of the four shared-memory programs ran
at roughly the same speed as their message-passing equivalents, even though their
communication patterns were different. Similar results are reported in [12]. Therefore, if
both paradigms achieve the same speedup it is preferable to choose the shared-memory
approach which is more user friendly.
1.1.3 Image Processing
Parallel image processing and computer vision have exhibited a tremendous growth in the
past decade. This process has been driven not only by the need for fast processing but
also from the fact that parallelism suits well to the tasks of digital image processing and
to the nature of digital images. Digital images are sampled on a rectangular grid and are
stored on a two dimensional array [23]. Therefore, they possess an inherent geometrical
parallelism [24]. This parallelism can be exploited by using a large two-dimensional array
of processors, possibly one per image pixel. However, this is possible only for small
images. Thus, a 512x512 or a 1024x1024 image is segmented (partitioned) in square
blocks or in strips and each block/strip is assigned to a specific processor. The latter
method allows general purpose parallel computers to solve problems in digital image
processing.
Parallel architectures can be classified into two large classes: Single Instruction
Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD) machines. SIMD
10
machines dominated in parallel digital image processing in the past. Currently MIMD
computers are progressively taking their space. The main reason is the wide availability
of general purpose MIMD computers and their use by mainstream scientists and
engineers. Some of these machines (e.g., DSP-based) are attractive because they combine
low cost and high numerical performance [21][22][26]. MIMD machines are further
divided into multiprocessors and multicomputers.
Many digital image processing algorithms are essentially local neighborhood
operations in the form:
where xij, yij are the input and the output image respectively. F is an operator (linear or
nonlinear) and A is its template window. The most frequently used window is square of
size 3x3. It contains the pixels having city block distance 1,2 from the central pixel.
Neighborhood parallelism denotes the possible parallel execution of local neighborhood
operations. Local neighborhood operations are not usually executed in parallel in general-
purpose parallel computers. They affect significantly the overall speedup of a parallel
system due to the greater locality of data (spatial and temporal), which is observed when
there is a greater the number of local neighborhood operations. It should be noted that
data locality is a determining factor for the cache or prefetching cache hit ratio. Spatial
locality means that if a location in memory is accessed, then others in the neighborhood
will probably be accessed shortly. Temporal locality means that recently referenced items
are likely to be referenced again in the near future [32].
1 1
1.2 Motivation, Objectives, and Contributions
Access to the common memory is the key limiting factor in the performance of shared-
memory multiprocessors. The traversal of the processor-shared-memory interconnect
employs large latencies as the number of processors increases. The advent of ultra-fast,
heavily pipelined, multifunction, one-cycle-per-instruction-execution-time modern
processors have made the problem worse [19], allowing only a handful of nodes to be
effectively supported on a shared-bus shared-memory multiprocessor system. Moreover
due to the significant memory requirements of the computationally intensive nature of
digital image processing applications [18] DSP-based multiprocessors can support even a
smaller number of processing units. This limitation enforces current commercial DSP-
based shared-memory systems to employ expensive and complex crossbar switch
networks in order to support even the small number of four high-performance DSPs
[20][21]. Our vision is to meet the requirements for real-time image processing
application execution through the least expensive and least complex hardware.
The objectives of this dissertation are: (1) to introduce a new shared-bus shared-
memory multiprocessor system architecture that can maximize throughput, minimize cost
and has the potential of supporting effectively real-time high-resolution image-
processing; (2) to contact an extensive investigation of the performance of the proposed
architecture; (3) to propose appropriate values for several system design parameters.
Indeed the proposed, in this dissertation, shared-bus shared-memory
multiprocessor system can effectively support 32 DSPs. This is in sharp contrast to
existing DSP-based bus-interconnected systems, which can support only a very small
number of DSPs. This is achieved through the elimination of the traditional direct linkage
12
of the shared bus and processor data bus, which enables the utilization of a wider shared
bus. Moreover, the fast prefetching cache memories and the wider shared bus provide
additional bus bandwidth that can eliminate large memory latencies and improve
significantly the number of effectively supported processors. It should be noted that the
proposed system can maintain peak performance regardless of image size (small or large).
1.4 Outline
The remaining of this dissertation is organized as follows. Chapter 2 provides a
description of the proposed twin-prefetching DSP-based shared-memory system,
including a overview of the ADSP-21060. Chapter 3 introduces the methodology for the
performance analysis and evaluation of the proposed system. Chapters 4 and 5 present
performance results that provide a useful insight into the behavior and efficiency of the
system. Chapter 6 discusses optimal size selection for prefetching caches. Finally,
Chapter 7 presents the conclusions and future research.
CHAPTER 2
TWIN-PREFETCHING DSP-BASED SHARED-MEMORY SYSTEM
The proposed multiprocessor is a high-speed low-cost DSP-based twin-prefetching
shared-memory MIMD parallel system (Figure 2.1). It consists of P nodes where P is
power of two. The system is investigated for several values of P such as 1, 2, 4, 8, 16, 32
and 64. At the heart of each node, shown in Figure 2.2, is a DSP processor (ADSP-
21060) optimized for image processing, graphics, speech, sound and other high-speed
numeric processing applications. Several characteristics of this high-performance DSP
processor (ADSP-21060) are presented in Section 2.1. Each node is also comprised of
two high-speed memories with their controllers, called Twin] and Twin2. Their twin-
prefetching operation is described in Section 2.2.
2.1 The ADSP-21060
The ADSP-21060 is a super harvard architecture processor (SHARC) [15]. It is the most
versatile and powerful processor offered by Analog Devices, an industry leader in DSP
technology [29]. It has four independent buses for dual data, instructions, and I/O. With
its separate program and data memory buses and on-chip instruction cache, the ADSP-
21060 can fetch two operands and an instruction (from the cache), all in a single cycle.
The ADSP-21060 key features are:
13
14
Single-cycle multiply & ALU operations with dual memory read/writes and
instruction fetch.
Figure 2.1 Twin prefetching multiprocessor system diagram
Figure 2.2 Node block diagram
15
• Super harvard architecture - four independent buses for dual data, instructions,
and I/O
• Dual ported, for independent access by core processor and DMA, on-chip 4
Mbit SRAM and integrated I/O peripherals
• Integrated multiprocessing features
• Glueless connection for scalable DSP multiprocessing architecture
• Distributed on-chip bus arbitration
• Six link ports for point to point connectivity and array multiprocessing
• 240 Mbytes/s transfer rate over parallel bus
• 240 Mbytes/s transfer rate over link ports
• 120 MFLOPS peak performance
• 40 MIPS, 25 ns instruction rate
• Dual data address generators with modulo and bit reverse addressing
• Efficient program sequencing with zero-overhead looping: single cycle loop
setup
• 32-bit single precision & 40-bit extended-precision IEEE floating-point data
formats
• 32-bit fixed-point data format, integer & fractional, with 80-bit accumulators
• 10 DMA channels
• Background DMA transfers at 40 MHz, in parallel with full-speed
processor execution
16
• Performs transfers between ADSP-21060 internal memory and external
memory, external peripherals, host processor, serial ports, or link ports
• Three independent computational units
• Arithmetic logic unit (ALU)
• Multiplier/Accumulator (MAC)
• Shifter
• Internal instruction cache
• Provisions for multiprecision computation and saturation logic
• Multifunction instructions
The ADSP-21060 flexible architecture comprehensive instruction set supports a high
degree of parallelism. In one cycle the ADSP-21060 can:
• generate the next program address
• fetch the next instruction
• perform one or two data moves
• update one or two data pointers
• perform a computational operation
2.1.1 Multiprocessing
The ADSP-21060 offers powerful features tailored to multiprocessing DSP systems. The
unified address space allows direct interprocessor accesses of each ADSP-21060's
internal memory. Bus arbitration logic is included on-chip for simple, glueless connection
of systems containing several ADSP-21060s and a host processor. Master processor (bus
master) changeover incurs only in one cycle overhead. Bus arbitration is selectable as
either fixed or rotating priority.
17
Due to the fact that gluless connection is limited only to 6 processors and the
proposed system is investigated for up to 64 processors, an additional arbitration unit on
the shared bus is assumed implementing rotating priority algorithm.
2.1.2 Dual-Ported Internal Memory
The ADSP-21060 contains 4 megabits of on-chip SRAM, organized as two blocks of 2
Mbits each, which can be configured for different combinations of code and data. Each
memory block is dual-ported for single-cycle, independent accesses by the core processor
and I/O processor or DMA controller. Memory can be accessed as 16-bit, 32-bit, or 48-bit
words.
While each memory block can store combinations of code and data, accesses are
more efficient when one block stores data, using the data memory (DM) bus, and the
other block stores instructions and data, using the program memory (PM) bus. Thus, a
dedicated bus to each memory block assures single-cycle instruction execution with two
data transfers. It should be noted that dual data transfer is possible if the instruction is
available in the instruction cache.
Single-cycle instruction execution is also maintained when one of the data
operands is transferred to or from off-chip, via the ADSP-21060's external port. This is
how the proposed twin-prefetching system operates, storing code and some data (filter
coefficients, for example) in the internal memory and retrieving all image data from the
DM data bus through the external port.
1 8
2.2 Data Memory
In programming systems, processors are generally considered active resources and
memory is viewed as passive resource. For the proposed system, data memory
(prefetching cache) functions as both a passive and an active resource; passive because it
supplies the core processor with data and active because it initiates and completes data
transfers from/to the global (shared) memory.
Data memory is comprised of two controllers (twin TTCs, and two fast memories
(twin-prefetching caches) placed between the DSP processor and the shared bus
interconnect. The two TTC/cache pairs are referred to as Twin] and Twin2. In a typical
operation, one Twin is accessible to the processor providing data operands while the other
Twin is transferring data from/to the shared memory.
The elimination of the traditional direct linkage of the shared and processor data
bus enables the utilization of a wider shared bus, i.e., shared bus width becomes
independent of the data bus width of the processors. The fast twin-prefetching memories
and the wider shared bus provide additional bus bandwidth into the system, which
eliminates large memory latencies; such memory latencies constitute the major drawback
for the performance of shared-memory multiprocessors.
2.2.1 The TTCs
The TTC1 and TTC2 controllers are, more specifically, DMA-like devices capable of
two-dimensional addressing. They move rectangular regions of data between global
memory and one of the node's caches. In addition to the DMA capability, the two TTCs
comprise a bi-directional communications port, which allows command, status, and
19
parameter passing between the VME-bus-connected host and the ADSP-21060.
(Communication capability of Twins is not necessary since ADSP-21060 could connect
directly to VME-bus through its own communication channels; it is required though, if
we are to place a less versatile processor at the node's core.)
The TTC is accessed and operated through a set of registers. This group of
registers appears in the ADSP-21060 data memory space. To transfer data between global
memory and one of the node's prefetching caches, the ADSP-21060:
• loads the TTC registers with the Start and End addresses of the block of data in
the global memory
• loads the TTC registers with the Start and End addresses of the block of data in
cache
• loads the TTC CONTROL register
• sets the "start transfer" bit in the CONTROL register and the transfer begins
The direction of the transfer depends on the value of a specific bit in the control register.
Loading and unloading the Twins occur simultaneously with data processing. In other
words, as soon as a block of data has been moved into the cache, the ADSP-21060 begins
processing it, while the other cache is emptied and then filled with new data. The
processor finishes processing data in Twin] , and switches to Twin2, which is filled with
fresh data. The back and forth switching of Twin1 and Twin2 allows maximum utilization
of resources, and thus optimum system performance.
For P processing elements a maximum of 2P TTCs compete for the bus which is
granted to the TTC by an arbitrator implementing rotating priority. Processor P i is
20
serviced before processor 	 and Twin] is serviced before Twin2. More precisely, if
Twinij is the jth Twin of the ith processor, the rotation proceeds as follows: Twin
Twin2I,..., Twiny ], Twin] 2, Twin22,..., Twinp2.
2.3 Host Processor
Host processor is responsible for booting all nodes and downloading all necessary code
and some data to the internal memory of every ADSP-21060. The data downloaded to the
internal memories include the addresses of image segments in the global memory which
every node is assigned to process. These addresses are the result of partitioning (a
technique for decomposing a large data set into many small pieces for parallel execution
by multiple processors). These addresses are calculated once for a specific image size and
application and are available for all future requests.
Asynchronous transfers at speeds up to the full clock rate of the processor (ADSP-
21060) are supported. The host interface is accessed through the ADSP-21060's external
port and is memory mapped into the unified address space. Four channels of DMA are
available for the host interface. The host processor requests the ADSP-21060's external
bus with the host bus request (HBR), host bus grant (HBG) and ready (REDY) signals.
The host can directly read or write the internal memory of the ADSP-21060, and can
access the DMA channel set up and mailbox registers.
2.4 Theoretical Analysis
The performance of a computer system, in general, is greatly affected by the amount of
locality present in code or data of an application. Traditional caches and prefetching
21
caches were invented to take advantage of the temporal, spatial, and sequential locality
present in applications. Spatial locality is the tendency of a process to access items whose
addresses are near one another i.e., indicates that if a location in memory is accessed, then
others nearby will probably be accessed soon. Temporal locality denotes that if an
address in memory is accessed once, then it will probably be accessed again soon. Lastly,
sequential locality indicates the sequential order of most instructions in code.
DSP algorithms are usually encoded in relatively small programs, which occupy
moderate amounts of program memory. Our system could easily store them in the
internal on-chip memory allowing single cycle multiple instruction execution if image
data can be retrieved as fast as well. The real challenge is data memory. It becomes
prohibitively costly to store several large high-resolution images on expensive fast data
memories (SRAM). Twin prefetching of image segments (along with the use of a wider
shared bus) is a cost effective solution for literally taking advantage of every clock cycle
of every DSP, i.e., we obtain 100% cache hit ratio (if there is enough shared-bus
bandwidth not to stall the prefetching mechanism).
2.4.1 Partitioning Images into Cache Prefetching Segments
Images are stored in the shared memory (inexpensive DRAM). All data are partitioned
into several segments. The segments' addresses could be calculated either by the host or
node processor. For our system the host processor calculates and downloads the addresses
to the nodes. Once calculated, the addresses are stored in the local memory of the host
processor and are available when needed again. A segment should be smaller than the
prefetching cache size for two reasons:
• For the segment to fit in the cache.
• To leave some space for the produced output.
We should note that even if a quite large cache prefetching memory is available, a smaller
segment may eventuate a faster system execution time. In other words, there should be
enough prefetching cache memory to satisfy the most demanding applications
(prefetching cache size demanding applications). It is not necessary to utilize all of it.
If P processing elements are available, the image is divided into lxP segments,
i.e., l segments of data are allocated to every node. (Adjacent segments may overlap each
other in certain applications). Interchangeably TTC1 and TTC2 are prefetching the
segments in their respective caches. The node processor switches back and forth
processing data in Twin] and Twin2. Constantly retrieving data from fast prefetching
caches, processors maintain single cycle instruction execution time. Continuous data
retrieval from prefetching caches eliminates shared bus latencies and maximizes system
performance. Data segments should weight equally, (contain the same amount data-take
the same amount of processing time) as much as possible, in order for the system to be
well balanced-a basic condition for obtaining maximum throughput.
2.4.1.1 Example - Two Dimensional Convolution Partitioning Rule: Let P be the
number of processing elements P,, P,,..., Pi,....,P,, in the system. Let the input be an M rxN,
image where M, is the number of pixel rows numbered 0,1,...,M r-1 while /\I, is the number
of pixel columns numbered . Let also a template matrix m rxn, where m, is the
number of rows and nc  is the number of columns of the matrix.
23
If P processors take part in the convolution, the image is partitioned into P
sections. One way to do that is to divide M r
 by P. Assuming Mr
 is divisible by P, all
sections consist of the same number of rows. For the convolution algorithm k+mr-1 input
rows are required to produce k output rows. Taking this into consideration every section
should include an additional mr-/ rows from the following section below. The last
section at the bottom of the image wraps around (to the first section). The total number of
rows in a section becomes M1P+m r-I.
Partitioning assigns the 1 1
 section to processing element P. We define r, the range
of rows in the i d, section. r, is then described by
Every section consists of 1 segments (smaller divisions of a section which are
prefetched interchangeably by the Twins). If y output rows are produced for every
prefetched segment, l = Mr/P/y (1 is usually proportional to prefetching input and not
output, but it so happens that in the two-dimensional convolution algorithm the number
of prefetching input rows is equal to y+m r-1).
Let begin_row_pref addr and end_row_pref_ addr be the two row addresses that
include all data in a segment to be prefetched. All addresses of segments in the global
memory are readily determined according to the following simple code:
24
end row_pref_ addr = begin_row_pref_ addr + y + m r — 2
jump to loop
Column addresses remain constant for all prefetching segments. They receive their values
from the start and end column address of the image in the global memory.
Let us give some specific values to the variables in Equation 2.1. Let M 1=1024. If
P = 2, then i = 1,2. Substituting i with 1 and then 2 in the equation we find 05_r,511+(m r
-1) and 512<r2<(1023+(m r-1 (wrap around))). Ranges r 1 and r, include all segments
assigned to P I and P2 respectively. If y = 4 then 1 = 128 and every segments consists of 4
+ (m 1-1) pixel rows. Likewise, if P=4, then i= 1,2,3,4, and 0<r1<255+(mr- I  ),
256r2<511+(mr-1), 512<r3<767+(mr-1), and 768<r4<(1023+(mr-1 (wrap around))). If y
4 then 1= 64 and every segments consists of 4 + (m 1-1) pixel rows.
2.4.2 Categories of Applications
From this point on we will refer to t w as the time required by the processor to process a
segment of data, tUL as the sum of t L (time required by a TTC to load a segment in the
prefetching cache) and t u
 (time required to unload the results produced from processing
the previously loaded segment), and Twratio as the ratio tw/tUL . It should be noted that in
this section we assume that arbitration delays and processor switching from one Twin to
the other are considered negligible. Negligible are also considered the few clock cycles
which the processor spends to transfer to a Twin the address of the next segment.
Simulation, though, takes into consideration all small details of the multiprocessor
system.
25
Different applications require different t w to process the same amount of data. The
larger the tw the more bus bandwidth available to serve a greater number of nodes.
(During time tw the Twin does not hold or request the shared bus).The Twratio (tw/tUL) is
extremely important for the twin-prefetching multiprocessor system. If Twratio's value is
equivalent to P (Eq. 2.2), the twin-prefetching multiprocessor system reaches maximum
utilization of its resources (i.e., bus and processors bandwidth). Twratios greater than P
move the system further away from bus contention. Twratios less than P, on the other
hand, induce bottleneck and limit multiprocessor's throughput.
The boundary condition (tw/tUL=P), serves as a basis in order to comprehend the
operation and capabilities of the twin-prefetching multiprocessor system. It could be
proven simply as follows: For P processing elements there are 2P Twins in the system
sharing the common bus with rotating priority. Let T rot be the total time for all Twins to
unload results and load a new segment of image data. By definition, at the boundary
condition there is just enough bus bandwidth for all Twins to have their segments ready
for processing by their core processors. Therefore Trot is given by
During time T rot a processor has just enough time to process data in both Twins, i.e,
finishes processing Twin] and immediately switches to Twin2 without having to wait for
any data transfer to finish. Therefore Trot
 is also given by
26
Trot = 2 x tw 	(2.4)
From Equations 2.3 and 2.4 we derive the boundary Equation 2.2.
Hence every application falls into one of three possible cases:
Cases one and two are preferable since bottleneck is undesirable in all multiprocessor
systems.
The twin-prefetching multiprocessor system is expected to perform well over
other existing shared memory multiprocessor systems because the "twin" architecture
allows cache-only data retrieval. An invaluable advantage over other systems is the
elimination of the traditional direct linkage of the shared bus and processor data bus
which makes feasible the utilization of a wider shared bus.
2.5 Simulation
Simulation model consists of the ADSP-21060, two controllers (twin TTCs), two fast
memory modules, a shared memory and a shared-bus interconnect. The most difficult part
of the simulation was the decision about the location of timers. Usually timing units are
the processing units themselves. In our system the processing units do not communicate
directly with the shared-bus; they communicate through TTCs. Therefore, timers were
27
placed on TTCs which monitor both the shared-bus and processor activity. If there is
adequate bandwidth on the bus (for image segments to be loaded on time and to be ready
in prefetching caches for processing) then timer of TTC 1 starts where the timer of TTC2
stopped and vise versa. If not sufficient bandwidth is available, then delays are
encountered on the shared bus which have to be considered. Accessing the cache takes 25
ns (the same as the instruction rate) while accessing shared-memory takes 75 ns.
Arbitration costs 4 clock cycles, processor switching from one twin to the other costs I
clock cycle, and informing the TTC about the next unload and load costs 20 cycles. All
instructions are executed in one clock cycle. Multifuntion instructions with dual data
fetches are executed in one clock cycle only if they are available in instruction cache.
Hence all multifuction instructions which implement the convolution algorithm
(appendix A) execute in two clock cycles the first time the loop is entered; in any
subsequent loop they are executed in one clock cycle.
CHAPTER 3
PERFORMANCE ANALYSIS
The DSP-based twin-prefetching shared-memory multiprocessor is investigated for P
processing elements, several shared-bus interconnect widths (n1) and several prefetching
cache sizes (csz). P receives values such as 1, 2, 4, 8, 16, 32 and 64 and n1 receives values
such as 32, 64, 128, 256 and 512 while csz receives 10 different values between a few
Kbytes and 4 Mbytes. Every possible combination of P, nl, and csz define a
multiprocessor system with different capabilities. All system configurations are numbered
350. We specifically investigate the following:
• which configurations of twin prefetching multiprocessor system perform better
• how many nodes could be effectively supported by the shared-bus
interconnect
• the optimal prefetching cache size
• the effect of changing the shared-bus-width (n1) on the performance of twin-
prefetching multiprocessor system
• the effect of the wide variance of temporal and spatial locality (present in
different image processing applications) on the performance of twin-
prefetching multiprocessor systems
• the effect of the wide variance of temporal and spatial locality of applications
on the prefetching cache size
28
29
Timing the execution of several image-processing applications would have been a
straightforward but inaccurate way to investigate all configurations of twin-prefetching
system. One reason is that the quantity of spatial and temporal localities (factors that
significantly affect Twratio) are not readily observed in different image-processing
applications; thus results would be quite deceiving if applications not having a wide range
of Twratios were selected. Furthermore, if the selection of some applications for the
investigation of one system is usually a debating subject for researchers one could
imagine the degree of disagreement when a number of systems is involved. Another
reason is that different applications may require different partitioning algorithms, which
possibly divide the image into significantly dissimilar segments. Thus making unfeasible
the fair performance comparison of systems. It would be impossible, for example, to
determine whether the source of a resulted significant change in the value of the optimal
prefetching cache size is the application itself or some unanticipated (before simulation)
operational detail affecting twin-prefetching.
3.1 Methodology for Performance Evaluation
In order to overcome the above problems a fair methodology is devised in order to
investigate and evaluate the performance of the 350 different configurations of twin-
prefetching shared-memory system when executing computationally intensive high-
resolution image-processing applications. This methodology is based on finding first the
main characteristics of image-processing applications when implemented on the
twin-prefetching system. Afterwards we apply low and high end values of the characteristics
aiming to cover all possible cases.
3 0
The factor, which includes all characteristics of an application executed on the
proposed system, is Twratio (tw/tUL). According to the value of Twratio, a particular
application falls into one of three cases named Case I, Case II, and Case III:
I. Twratio = P maximum utilization of bus and processor bandwidth
II. Twratio > P	 unused shared-bus bandwidth available
III. Twratio <P	 bottleneck - unused processor bandwidth
All cases are also discussed in Chapter 2 where it is established that Twratio is the
determining factor for the performance of an application on any configuration of the twin-
prefetching multiprocessor. It inherently combines all the fundamental software and
hardware parameters. Some of the parameters are: amount of spatial and temporal locality
involved with the application (a major contributor to tw), speed of DSPs, access time of
both shared-memory and prefetching cache memories, partitioning algorithm of input
images (prefetching segment size), produced output segment size, shared-bus-
interconnect-width, communication overhead, P, and the shared-bus bandwidth. If
hardware parameters are set to specific values, then Twratio depends only on application
parameters like:
• tw
• input segment size &
31
• produced output size & t U
• communication overhead
(It should be noted that the above parameters are both hardware and software dependent.
They are referred as application dependent when hardware parameters receive constant
values.)
tw (the time required by the processor to process a segment of data) is proportional
to spatial and temporal locality of the application, i.e., the more times memory-addresses,
of data in the prefetched segment, are accessed the larger the value of t w . In other words,
tw is analogous to the (average) number of times which every item in the prefetched
segment is referenced. t L is dependent on the size of input segment. If the segment size is
increased, t L and tw are also increased. tU is also generally increased. (In very rare cases
there is very little output, which is transferred to the shared memory at the last unload.)
Regardless though, of how much every parameter value increases or decreases what
really matters is Twratio (tw/tUL).
We investigate the effect of different to and t US by varying the prefetching
segment size. Moreover, varying the size of the input segment in the prefetching cache
(taking into account the change of output size too), we evaluate, also, the performance of
different sizes of prefetching cache. It should be noted that there are obvious and not so
apparent factors determining the optimal size of prefetching cache. Obvious and
predictable factors are: the size of input segment and the size of produced output. The less
obvious factors, which might affect the optimal prefetching cache size and the operation
32
of the system in general, are the various delays due to initial loading of segments
(transient performance), amount of time spent by the processor to instruct the Twins
about the next loading and unloading, amount of time spent by the processor to switch
from one Twin to the other, and possible unknown effects of the twin-prefetching
mechanism.
Finally, communication overhead (17) should be a small fraction of the workload
(w) in a parallel algorithm in order for the algorithm to be efficiently mapped onto a
parallel computer. The smaller the communication overhead the closer the efficiency (E)
to 1. Generally the efficiency of a parallel algorithm is defined as
We intent to show that the Twin prefetching system could handle great amount of
communication operations with little overhead due to the twin-prefetching mechanism
and additional bandwidth provided by a wider shared-bus.
We could summarize what was said so far in this chapter with two contradictory
statements.
1. We need the characteristics of several applications to investigate and evaluate
all 350 hardware configurations of the twin-prefetching system.
2. Timings several different applications would not support a fair comparison of
all twin-prefetching systems.
33
All the above challenges are met by selecting an application, which behaves like many
applications if its attributes are judiciously varied.
3.2 Two-Dimensional Convolution
The selected application is two-dimensional convolution, which serves as the
fundamental operation for a wide variety of other image-processing and computer vision
applications, e.g., edge detection, object detection, image smoothing [16], edge
enhancement. Because of the fundamental nature of this application problem and because
of its high computing complexity on a single processor system, much attention has been
devoted to the development of efficient parallel architectures and algorithms for its
implementation [38][39][40][41]. The inputs to the two-dimensional convolution are an
MrxNc. image matrix I[0...(M,.-1)] [0...(1\1-1)] and an mxn template matrix T[0...(mr-1)]
[0...(n c-1)]. The output image is an MrxNc matrix C2 D where:
for 0 Mr-1 and 0 j
C2D is called the two-dimensional convolution of I and T. When implementing
two-dimensional convolution on the twin-prefetching multiprocessor, template matrix is
allowed to extend outside the input window (wraparound); otherwise, the output matrix is
smaller than the input matrix by m r-1 rows and n c-1 columns.
34
Input image matrix I is partitioned among processors according to the rule
described in section 2.4.1.1 assuming that M r and 1\I, are powers of 2.
Let us examine, in more detail, how convolution serves as the foundation for the
fair investigation and evaluation of all 350 hardware configurations of twin-prefetching
system, i.e., how parameters of the convolution problem receive values, such that they
cover a wide variety of applications.
3.3 Application Parameters
Different Twratios ( twl(tL+tU) ) are produced by varying the size of the template matrix.
By increasing the template size we increase t w (processing time of a segment) while
keeping the size of input segment (and consequently t L) virtually the same. Spatial and
temporal locality of the application (we refer to locality of data of the application, not
code) is also increased. Measure for locality is the number of times which every segment
pixel is referenced in prefetching caches. Typically every datum in caches or prefetching
caches is referenced from a few to about 30 times. Template matrix sizes of 2x2, 3x3,
6x6, and 9x9 are utilized, yielding localities of 4, 9, 36, and 81, respectively, thus,
covering a wide range of data localities and Twratios.
Investigation of the effect of prefetching cache size on performance is
accomplished by keeping all other hardware and software parameters constant and
varying the size of input segment in the prefetching cache from a few Kbytes to 4 Mbytes.
Ten different sizes of input image segment cover a wide range of t L, tU , and prefetching
cache size.
Before explaining the way we introduce communication overhead to the system,
3 5
let us be reminded that multicomputers (distributed memory systems) communicate
through message-passing among nodes while multiprocessors (shared memory systems)
communicate with each other through the shared-memory. In the case of convolution, if a
output matrix rows are to be produced α+mr-1 rows have to be present. Let us assume
that the image is partitioned dividing the number of rows by the number of processing
elements. In a multicomputer system every processor would have to transfer
(communicate) the m r-1 rows from other nodes in order to produce the a output rows. In
a bus connected shared-memory system every node retrieve extra m r-1 rows (besides the
equal shared from dividing all rows by P) in order to produce the a output rows. In
general, when solving convolution using a parallel computer and not a uniprocessor
system P(mr-1) extra rows of data have to be communicated. The ratio (mr -1)/α is a direct
measure of the communication overhead involved and its value is traditionally expected
to be small (much smaller than 1).
We test the twin-prefetching shared-memory system under severe communication
overhead by allowing the ratio (mr-1)/α to take values much larger than one. For example
if a=1 (in a prefetched segment) and template matrixes 2x2, 3x3, 6x6, 9x9 are used, the
value of m r-1 becomes 1,2,5,8 respectively. The twin-prefetching shared memory system
is also tested under lighter communication overhead. We introduce extreme ratios like
1/1, 2/1, 5/1, and 8/1, in order to investigate the performance of the proposed architecture
with applications employing large amount of communication overhead. It is important to
note that bus connected shared memory systems are extremely sensitive to the amount of
communication overhead. This is due to the fact that all system activity should "fit in
36
only one path." For this reason DSP-based bus connected architectures support only up to
4 processors. We expect the proposed architecture to effectively support more than four
due to the twin-prefetching mechanism and wider shared-bus.
The convolution is performed on eight continuous image frames stored in the
shared memory. The resolution of every image is 1024x1024. We choose to test the
proposed architecture with large, high-resolution images because we want to show that
this architecture can sustain maximum performance in challenging processing conditions.
Existing systems claim real-time execution of small (64x64 or 128x 128) images, which
fit in small local or internal SRAM memories. If they are to perform operations on larger
images their performance degrades dramatically. The twin-prefetching architecture does
not have these limitations.
The execution time of an application on the proposed system is the time after all
output results are transferred in the shared memory. This is in contrast to some
investigations of distributed or shared-memory systems, which present impressive but
deceiving results by measuring execution time excluding time to load and time to unload
results to/from local memories.
CHAPTER 4
EFFECT OF SHARED-BUS-WIDTH ON PERFORMANCE
Chapter 4 investigates the effect of shared-bus-interconnect-width on the performance of
several configurations of the proposed P-node DSP-based twin-prefetching shared-
memory multiprocessor systems. P receives values 1, 2, 4, 8, 16, 32 and 64 and shared-
bus-width (nl) receives values 32, 64, 128, 256, and 512. Tables and figures in this
chapter demonstrate how the performance of a P-node twin-prefetching multiprocessor
system changes as shared-bus-width changes. The number of nodes (processors) which
can be effectively supported for a specific value of shared-bus-width is also investigated.
Prefetching cache size value was kept between 262 Kbytes and 290 Kbytes. The selection
of prefetching cache size is analyzed in Chapter 6. Not too much attention should be paid
to the specific prefetching cache size since in most cases prefetching cache sizes ranging
from —100 Kbytes to --300 Kbytes yield performances with no more than 2% difference.
Four applications (convolution utilizing four different template matrices) were
executed for all multiprocessor system configurations. Timings (T) are based on
convolution of eight, continuous high-resolution images (1024x1024), stored in shared
memory. Pixel size is 16 bits.
Every application demands different amount of shared-bus bandwidth. In Sections
4.1, 4.2, 4.3, and 4.4 we investigate the performance of several P-node systems for
several values of nl when the template matrix is 2x2, 3x3, 6x6, and 9x9 respectively. The
reason for focusing at the timings of individual applications (one by one) is to ensure that
37
3 8
the observed changes in availability of shared-bus bandwidth is due only to the change of
the shared-bus width, not on the application. With the same criteria the fair comparison of
all configurations of twin-prefetching multiprocessor system is also ensured. Finally, a
small discussion on all applications is provided in Section 4.5.
Speedup factor S(P), indicates how much faster a P-node system executes an
application compared to a uniprocessor system (nl is kept the same in both cases). S(P) is
given by
where T(1) is the time required to execute a program on a uniprocessor while T(P) is the
time required to execute the same program on a P-node (processor) system. We define the
bus-width-speedup-factor S(nl), as the ratio of T(32) and T(nl). T(32) is the time required
to execute a program on a P-node system with shared-bus width equal to 32, and T(nl) is
the time required to execute a program on the same P-node system with shared-bus width
equal to nl. Thus S(nl) is given by
39
System efficiency, E(P), of a P-node multiprocessor is given by
The best possible system efficiency is 1, i.e., the best speedup factor is linear, or S(P)=P .
Tables labeled "Time vs. Number of Processors" consist of six columns. The first
column contains P values, while the other five columns contain timings of the P-node
system when n1 is 32,64,128,256, and 512. The corresponding speedup, S(P), and
efficiency, E(P), of every entry is given in Tables labeled "Speedup vs. Number of
Processors" and "Efficiency vs. Number of Processors," respectively.
Tables labeled "Speedup vs. Shared-bus-width" consist of eight columns. The first
column contains shared-bus-width (nl) values, while the other seven columns contain the
speedup factor of a specific P-node system for every value of nl. SP1 in these tables
stands for S(1), SP2 stands for S(2), and so on. Appendix B contains tables and figures
describing "Execution Time vs. Shared-bus-width," and the corresponding tables and
figures for "Speedup vs. Shared-bus-width."
An effective system is considered to be a system with a speedup factor greater
than P/2, i.e., providing system efficiency greater than 0.5. Shaded areas within tables
labeled "Time or speedup or efficiency vs. Number of Processors," point out effective
systems.
4.1 Effect of Shared-Bus Width on Performance when Template Matrix=2x2
Tables 4.1, 4.2 and 4.3 and Figures 4.1, 4.2 and 4.3 show that Convolution with a
template matrix of 2x2 is effectively executed by up to four processors if nl=32, by up to
40
four processors if nl=64, by up to eight processors if n/=128, by up to 16 processors if
nl=256, by up to 32 processors if nl=512.
Figures 4.2 and 4.3 and Tables 4.2 and 4.3 show near perfect speedup factors and
system efficiencies (E0.90) for nl-=32&P=2, for nl=64&P=2,4, for nl=128&P=2,4, for
nl=256&P=2,4,8, and for nl=512&P=2,4,8,16.
Table 4.4 and Figure 4.4 show remarkable speedups of specific P-node systems,
which are due only to n/ increase. In Appendix B, Table B.1 and Figures B.1 and B.2
show, in more detail, how the execution time decreases by increasing the shared-bus-
width. Observing, for example, the first (0.6393 sec) and the last (0.0418 sec) entry of the
seventh column (corresponding to the 32-node system) in Table B.1, we measure a
shared-bus-speedup-factor of 15.6. Table 4.4 and Figure 4.4 show a maximum shared-
bus-speedup-factor of 1.7, 1.7, 2.9, 5.7, 11.2, 15.6 and 16 for 1-node, 2-node, 4-node, 8-
node, 16-node, 32-node, and 64-node systems, respectively.
Table 4.1 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=2x2
Figure 4.1 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=2x2
Table 4.2 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=2x2
41
42
Figure 4.2 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=2x2
Table 4.3 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix=2x2
43
Figure 4.3 Efficiency vs. P for n1=32, 64, 128, 256 and 512 when template matrix=2x2
Table 4.4 Speedup vs. Shared-bus-width (ni) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=2x2
44
Figure 4.4 Speedup vs. Shared-bus-width (ni) for P=1, 2, 4, 8, 16, 32 and 64 when
tempiate matrix=2x2
4.2 Effect of Shared-Bus Width on Performance when Template Matrix =3x3
Tables 4.5, 4.6 and 4.7 and Figures 4.5, 4.6 and 4.7 show that Convolution with a
tempiate matrix of 3x3 is effectively executed (E>0.50) by up to four processors if n1=32,
by up to eight processors if nl=64, by up to 16 processors if nl=128, by up to 32
processors if n1=256, by up to 64 processors if n1=512.
Figures 4.6 and 4.7 and Tables 4.6 and 4.7 show near perfect speedup factors and
system efficiencies (E_0.90) for nl=32&P=2,4, for nl=64&P=2,4, for nl=128&P=2,4,8,
for ni=256&P=2,4,8,16, for nl=512&P=2,4,8,16,32.
Table 4.8 and Figure 4.8 show remarkable speedups of specific P-node systems,
which are due only to ni increase. In Appendix B, Table B.3 and Figures B.4 and B.5
45
show, in more detail, how the execution time decreases by increasing the shared-bus-
width. Observing, for example, the first (0.6495 sec) and the last (0.0457 sec) entry of the
last column (corresponding to the 64-node system) in Table B.3, we measure a shared-
bus speedup factor of 14.2. The value 14.2 can be confirmed by looking up the entry with
coordinates nl=51280-64 in Table 4.8. Table 4.8 and Figure 4.8 show a maximum
shared-bus-speedup-factor of 1.3, 1.3, 1.4, 2.7, 5.3, 10.0 and 14.2 for 1-node, 2-node, 4-
node, 8-node, 16-node, 32-node, and 64-node systems, respectively.
Table 4.5 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3
Figure 4.5 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3
Table 4.6 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3
46
Figure 4.6 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3
47
Table 4.7 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix=3x3
Figure 4.7 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when tempiate matrix=3x3
Table 4.8 Speedup vs. Shared-bus-width (n1) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=3x3
48
Figure 4.8 Speedup vs. Shared-bus-width (nl) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=3x3
49
4.3 Effect of Shared-Bus Width on Performance when Template Matrix =6x6
Tables 4.9, 4.10 and 4.11 and Figures 4.9, 4.10 and 4.11 show that Convolution with a
template matrix of 6x6 is effectively executed (E_0.50) by up to 16 processors if n/=32,
by up to 32 processors if nl=64, by up to 64 processors if n1=128, 256, and 512.
Figures 4.10 and 4.11 and Tables 4.10 and 4.11 show near perfect speedup factors
and system efficiencies (E>0.90) for n/-3280=2,4,8, for n1=64&P=2,4,8,16, for
n/-12880=2,4,8,16,32, for nl=256&P=2,4,8,16, 32, for nl=512&P=2,4,8,16,32,64.
Table 4.12 and Figure 4.12 show remarkable speedups of specific P-node systems,
which are due only to nl increase. In Appendix B, Table B.5 and Figures 13.7 and B8
show, in more detail, how the execution time decreases by increasing the shared-bus-
width. Observing, for example, the first (0.680 sec) and the last (0.129 sec) entry of the
last column (corresponding to the 64-node system) in Table B.5, we measure a shared-
bus speedup factor of 5.27. The value 5.27 can be confirmed by looking up the entry with
coordinates nl=512&P=64 in Table 4.12. Table 4.12 and Figure 4.12 show a maximum
shared-bus-speedup-factor of 1.1, 1.1, 1.1, 1.1, 1.5, 2.8 and 5.3 for 1-node, 2-node, 4-
node, 8-node, 16-node, 32-node, and 64-node systems respectively.
Table 4.9 Time vs. P for nl-32, 64, 128, 256 and 512 when template matrix=6x6
Figure 4.9 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=6x6
Table 4.10 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=6x6
50
Figure 4.10 Speedup vs. P for n/=32, 64, 128, 256 and 512 when template matrix=6x6
Table 4.11 Efficiency vs. P for nl-32,64,128,256 and 512 when template matrix=6X6
Figure 4.11 Efficiency vs. P for n1=32, 64, 128, 256 and 512 when template matrix=6X6




Figure 4.12 Speedup vs. Shared-bus-width (nl) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=6x6
4.4 Effect of Shared-Bus Width on Performance when Template Matrix =9x9
Tables 4.13, 4.14 and 4.15 and Figures 4.13, 4.14 and 4.15 show that Convolution with a
template matrix of 9x9 is effectively executed (E..0.50) by up to 32 processors if n1=32
by up to 64 processors if nl=64, 128, 256 and 512.
Figures 4.14 and 4.15 and Tables 4.14 and 4.15 show near perfect speedup factors
and system efficiencies (E0.90) for nl=32&P=2,4,8,16, for nl=64&P=2,4,8,16,32, for
nl=128&P=2,4,8,16,32, for nl=256&P=2,4,8,16,32,64, for nl=512&P=2,4,8,16,32,64.
Table 4.16 and Figure 4.16 show speedups of specific P-node systems, which are
due only to n/ increase. In Appendix B, Table B.7 and Figures B.10 and B.11 show, in
more detail, how the execution time decreases by increasing the shared-bus-width.
54
Observing, for example, the first (0.710 sec) and the last (0.277 sec) entry of the last
column (corresponding to the 64-node system) in Table B.7, we measure a
shared-bus-speedup-factor of 2.56. The value 2.56 can be confirmed by looking up the entry with
coordinates nl=512&P=64 in Table 4.16. Table 4.16 and Figure 4.16 show a maximum a
shared-bus-speedup-factor of 1.04, 1.04, 1.04, 1.04, 1.07, 1.43 and 2.56 for 1-node, 2-
node, 4-node, 8-node, 16-node, 32-node, and 64-node systems respectively.
The shared-bus-speedup-factor when template window=9x9 is not significant.
The reason is that applications incorporating very large amounts of spatial and temporal
localities (like the one we examine) demand much less shared-bus bandwidth compared
to applications with less spatial and temporal localities. If there is enough supply of
shared-bus bandwidth when nl=32, we do not expect a significant improvement of
performance when a wider shared-bus is utilized. How spatial and temporal locality
affects the performance of the twin-prefetching multiprocessor is examined thoroughly in
Chapter 5.
Table 4.13 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=9x9
Figure 4.13 Time vs. P for nl=32, 64, 128, 256 and 512 when template matrix=9x9
Table 4.14 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=9x9
55
Speedup vs Number of Processors, template matrix=9x9
Figure 4.14 Speedup vs. P for nl=32, 64, 128, 256 and 512 when template matrix=9x9
Table 4.15 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix=9x9
57
Figure 4.15 Efficiency vs. P for nl=32, 64, 128, 256 and 512 when template matrix=9x9
Table 4.16 Speedup vs. Shared-bus-width (n/) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=9x9
58
Figure 4.16 Speedup vs. Shared-bus-width (nl) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=9x9
4.5 Discussion of Results
The elimination of the traditional direct linkage of the shared-bus and processor data bus
makes feasible the utilization of a wider shared-bus. Additional bandwidth provided by a
wider shared-bus along with twin-prefetching mechanism makes possible the effective
support (E>0.50) of 32 processors, and the near perfect effective support (E>0.90) of 16
processors. The above numbers of effectively supported processors are based on the worst
case scenario when an application embedding a small amount of spatial and temporal
locality (template matrix-2x2) is executed. In the best case scenario, when an application
embedding very high spatial and temporal locality (template matrix=9x9) is executed,
even a 64-node twin-prefetching multiprocessor achieves a system efficiency greater than
0.9 (E>0.90).
59
Tables 4.17 shows how many processors can be effectively supported (E>0.50)
for every value of iil. For example, the entry in Table 4.17 with coordinates
nl=512&template matrix=3x3, reads 2,4,8,16,32,64. This means that the twin -prefetching
system utilizing a shared-bus with 512 data lines could effectively support (E>0.50) up to
64 processors.
Tables 4.18 shows the P-node twin-prefetching multiprocessors achieving nearly
perfect system efficiency (E>0.90). Table 4.18 indicates a very large number of P-node
systems achieving such a high system efficiency value.
Tables 4.17 and 4.18 indicate which systems should or should not be
implemented. Let us take as an example the first entry (2, 4, 8, 16, 32) of the column
labeled n1=512 in Table 4.17. Since the number 64 is not included in the entry a 64-node
twin-prefetching system should not be built (because E<0.5) if an application with similar
characteristics (Twratio) to the one indicated (2x2) is run on the system. Depending on
the system efficiency required we select the appropriate table as a guide for the design. It
is important to note that if a specific P-node twin -prefetching system if effectively
supported by more than one value of nl, we should choose the smallest value. As an
example let us take the values of the first row of Table 4.18. We observe that shared-bus-
width of 64 or 128 or 256 or 512 supports well (E>0.90) a 4-node system. The logical
choice among all shared-bus-widths is 64.
Finally, the effect of utilizing a wider shared-bus for a twin -prefetching
multiprocessor is clearly demonstrated in both Tables 4.17 and 4.18, i.e., a wider shared-
bus increases significantly the number of effectively supported processing elements.
Table 4.17 P-node twin-prefetching multiprocessors with system efficiency E>0.50
60
Table 4.18 P-node twin-prefetching multiprocessors with system efficiency E>0.90
CHAPTER 5
EFFECT OF DATA LOCALITY ON PERFORMANCE
This chapter investigates the effect of spatial and temporal locality of data on the
performance of the proposed twin-prefetching shared-memory shared-bus multiprocessor
system. The performance of any computer system is greatly affected by the amount of
data locality present in an application. Traditional caches were invented to take advantage
of data locality and increase performance. (Spatial locality is the tendency of a program to
access items whose addresses are near one another i.e., indicates that if a location in
memory is accessed, then others nearby will probably be accessed soon. Temporal
locality denotes that if an address in memory is accessed once, then it will probably be
accessed again soon.) The proposed system employs software-controlled prefetching in
caches; a technique that further increases performance according to recent research.
In the case of the convolution algorithm, the number of local neighborhood
operations is a direct measure of the amount of data locality. The following discussion
intends to show the correlation of data locality, size of template matrix and number of
memory references of every pixel value in memory. The number of local neighborhood
operations is equivalent to the number of coefficients in the template matrix by definition.
In the case of the proposed system, the number of local neighborhood operations is also
equivalent to the number of memory references required to produce an output pixel, i.e.,
due to the dual data fetch capability of the ADSP-21060, (other modern DSPs have
similar capabilities), a coefficient and an input pixel value are fetched simultaneously
61
62
from memory; therefore, for a template matrix consisting of k coefficients, k dual fetches
(one from internal memory and one from external memory) take place to produce an
output pixel. We conclude that in the case of the convolution algorithm every datum
(except the edges) of an image segment transferred in the prefetching cache is referenced
k times (temporally or spatially). From this point on, locality k of an application will
denote that the average number of times which every image pixel is referenced is k.
Our goal of investigating the proposed system on different amounts of data
locality is achieved by increasing or decreasing the size of the template matrix.
Specifically, template matrix size is varied from 2x2 to 3x3 to 6x6 to 9x9; therefore, the
number of memory references of every image pixel value in data memory is 4, 9, 36 and
81, respectively. Thus, a wide range of data locality cases are covered.
There are five tables labeled "Execution Time vs. Number of Processors," one
table for every value that the shared-bus-width receives (nl=32, 64, 128, 256, 512). Each
table consists of five columns. The first column contains P values, while the other four
columns contain timings of the P-node system when the tempiate matrix used has size
2x2, 3x3, 6x6 and 9x9. The corresponding speedup S(P), and efficiency E(P), of every
entry in the tables labeled "Execution Time vs. Number of Processors" are found in tables
labeled "Speedup vs. Number of Processors" and "Efficiency vs. Number of Processors."
Notation in this chapter for tables or figures is as follows: Tmpl.mtrx or mtrx stands for
template matrix, S(zxz) and E(zxz) stand for speedup and efficiency, respectively, when a
template matrix of size zxz is used, P stands for number of processors, nl stands for
shared-bus-width and 1w/o stands for one processor without the twin-prefetching caches.
63
An effective system is considered to be a system with a speedup factor greater
than P/2, i.e., providing system efficiency greater than 0.5. Shaded gray areas within the
tables point out effective systems.
5.1 Effect of Spatial and Temporal Locality on a P-Node Twin-Prefetching
Multiprocessor when Shared-Bus-Width is 32
Tables 5.1, 5.2, 5.3, and Figures 5.2, 5.3, 5.4 show how the proposed twin-prefetching
system performs as the size of template matrix (amount of data locality) varies, when
shared-bus-width is 32. Adjacent row entries in Table 5.1 report two timings that give us
an indication of how fast two system configurations, employing different number of
processors, execute the same application. A useful piece of information is obtained by
observing the location at each column where the values of adjacent rows do not differ
significantly. The P coordinate at this location denotes the maximum number of
processors, which can effectively (with system efficiency greater that 50%) execute the
application. It should be mentioned that this location is the last gray shaded row in a
column. From left to right column shaded areas widen, denoting a sharp increase of the
number of processors that can effectively execute the application. When a 2x2 template
matrix (locality 4) is used only 4 processors effectively execute the application, while if a
9x9 template matrix (locality 81) is used the number of processors increases to 32. The
fuel that drives the additional 28 processors is data locality. Because of greater data
locality a larger number of references, for every image pixel in the prefetching cache,
keeps processors occupied with the same image segment for a longer period of time, thus
allowing a greater number of processors to be serviced by the shared-bus.
64
Tables 5.2 and 5.3, and Figures 5.3 and 5.4 show the speedup and efficiency of the
system for different sizes of template matrix and demonstrate the significant change of
performance depending on data locality. Speedup and efficiency are better system
performance evaluators (because they are observer friendly); a good speedup value
approaches value P while a good system efficiency approaches 1. When, for example,
P=16 in Table 5.2, the speedup is 2.31 for template matrix=2x2 while the speedup is
15.43 for template matrix=9x9. Likewise, when P=16 in Table 5.3, system efficiency is
0.14 for template matrix=2x2 while efficiency is 0.96 for template matrix=9x9. The
significant improvement in system efficiency is solely due to a greater amount of data
locality since all other software and hardware parameters are kept constant.
5.1.1 Performance of One DSP Processor Without Twin-Prefetching Cache
Memories
Twin-prefetching caches were removed from the one node system (P=1) and all four
applications were executed assuming all image data (eight images of size 1024x1024)
stored in main memory. Figure 5.1 shows a graph comparing the execution times of one
node system with and one without the prefetching caches. All applications were executed
faster on the system employing twin-prefetching. Results show a speedup of 2.88, 2.75,
2.23, and 1.70 when the template matrix used is 9x9, 6x6, 3x3, and 2x2, respectively.
Therefore applications employing greater amount of data locality achieve better speedup.
Also it is demonstrated that twin-prefetching could benefit uniprocessor systems too.
65
P-node systems (P>1), without the twin-prefetching caches, were not simulated
because a shared-bus would not be able to support more than one ADSP-21060. It should
be noted that ADSP-21060 does not carry on-chip cache.
Table 5.1 Execution Time vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when nl=32
Figure 5.1 Execution time of applications on a uniprocessor system with and without
twin prefetching (nl=32)
Figure 5.2 Execution Time vs. Number of Processors for template matrix=2x2, 3x3,
6x6, 9x9 when nl=-32
Table 5.2 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=32
67
Figure 5.3 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=32
Table 5.3 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when n1=32
68
Figure 5.4 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=32
5.2 Effect of Spatial and Temporal Locality on a P-Node Twin-Prefetching
Multiprocessor when Shared-Bus-Width is 64
Tables 5.4, 5.5, 5.6, and Figures 5.5, 5.6, 5.7 show how the proposed twin-prefetching
system performs as the size of template matrix (amount of data locality) varies, when
shared-bus-width is 64. From left to right column shaded areas widen, denoting a sharp
increase of the number of processors that can effectively execute the application. When a
2x2 template matrix (locality 4) is used only 4 processors effectively execute the
application while if a 9x9 template matrix (locality 81) is used the number of processors
increases to 64. The fuel that drives the additional 60 processors is data locality. Because
of greater data locality a larger number of references, for every image pixel in the
prefetching cache, keeps processors occupied with the same image segment for a longer
69
period of time; thus allowing a greater number of processors to be serviced by the shared-
bus.
Tables 5.5 and 5.6, and Figures 5.6 and 5.7 show the speedup and efficiency of
the system for different sizes of template matrix and demonstrate the significant change
of performance depending on data locality. When, for example, P=32 in Table 5.5
speedup is 3.62 for template matrix=2x2 while speedup is 29.68 for template matrix=9x9.
Likewise, when P=32 in Table 5.6 system efficiency is 0.11 for template matrix=2x2
while efficiency is 0.92 for template matrix=9x9. The significant improvement in system
efficiency is solely due to the greater amount of data locality since all other software and
hardware parameters are kept constant.
Table 5.4 Execution Time vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when nl=64
Figure 5.5 Execution Time vs. Number of Processors for template matrix=2x2, 3x3,
6x6, 9x9 when nl=64




Figure 5.6 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when n/-64
Table 5.6 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when n/-64
72
Figure 5.7 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=64
5.3 Effect of Spatial and Temporal Locality on a P-Node Twin-Prefetching
Multiprocessor when Shared-Bus-Width is 128
Tables 5.7, 5.8 and 5.9, and figures 5.8, 5.9 and 5.10 show how the proposed twin-
prefetching system performs as the size of template matrix (amount of data locality)
varies, when shared-bus-width is 128. From left to right column shaded areas widen,
denoting a sharp increase of the number of processors that can effectively execute the
application. When a 2x2 template matrix (locality 4) is used 8 processors effectively
execute the application while if a 9x9 template matrix (locality 81) is used the number of
processors increases to 64. The fuel that drives the additional 56 processors is data
locality. Because of greater data locality a larger number of references, for every image
73
pixel in the prefetching cache, keeps processors occupied with the same image segment
for a longer period of time, thus allowing a greater number of processors to be serviced
by the shared-bus.
Tables 5.8 and 5.9, and figures 5.9 and 5.10 show the speedup and efficiency of
the system for different sizes of template matrix and demonstrate the significant change
of performance depending on data locality. When, for example, P=32 in Table 5.8,
speedup is 6.25 for template matrix=2x2 while speedup is 30.78 for template matrix=9x9.
Likewise, when P=32 in Table 5.9 system efficiency is 0.19 for template matrix=2x2
while efficiency is 0.96 for template matrix=9x9. The significant improvement in system
efficiency is solely due to the greater amount of data locality since all other software and
hardware parameters are kept constant.
Table 5.7 Execution Time vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when nl=128
74
Figure 5.8 Execution Time vs. Number of Processors for tempiate matrix=2x2, 3x3, 6x6,
9x9 when nl=128
Table 5.8 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=128
Figure 5.9 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when n1=128
Table 5.9 Efficiency vs. Number of Processors for tempiate matrix=2x2, 3x3, 6x6, 9x9
when nl---128
Figure 5.10 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when nl=128
5.4 Effect of Spatial and Temporal Locality on a P-Node Twin-Prefetching
Multiprocessor when Shared-Bus-Width is 256
Tables 5.10, 5.11 and 5.12, and figures 5.11, 5.12 and 5.13 show how the proposed twin-
prefetching system performs as the size of template matrix (amount of data locality)
varies, when shared-bus-width is 256. From left to right column shaded areas widen,
denoting a sharp increase of the number of processors that can effectively execute the
application. When a 2x2 template matrix (locality 4) is used, 16 processors effectively
execute the application while if a 9x9 template matrix (locality 81) is used the number of
processors increases to 64. The fuel that drives the additional 48 processors is data
77
locality. Because of greater data locality, a larger number of references, for every image
pixel in the prefetching cache, keeps processors occupied with the same image segment
for a longer period of time; thus allowing a greater number of processors to be serviced
by the shared-bus.
Tables 5.11 and 5.12, and Figures 5.12 and 5.13 show the speedup and efficiency
of the system for different sizes of template matrix and demonstrate the significant
change of performance depending on data locality. When, for example, P=32 in Table
5.11, the speedup reached is 11.49 for template matrix=2x2 while speedup is 31.38 for
template matrix=9x9. Likewise, when P=32 in Table 5.12 system efficiency is 0.36 for
template matrix=2x2 while efficiency is 0.98 for template matrix=9x9. The significant
improvement in system efficiency is solely due to greater amount of data locality since all
other software and hardware parameters are kept constant.
As observed in Sections 7.1, 7.2, 7.3, and 7.4, as shared-bus-width widens, more
bandwidth is available to serve a greater number of processors. Therefore, both shared-
bus-width and data locality contribute to a better system efficiency. Thus, column shaded
areas in tables are larger as sections advance.
Table 5.10 Execution Time vs. Number of Processors for template matrix=2x2, 3x3,
6x6, 9x9 when nl=256
78
Figure 5.11 Execution Time vs. Number of Processors for template matrix=2x2, 3x3,
6x6, 9x9 when nl=256
Table 5.11 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when n1=256
79
Figure 5.12 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=256
Table 5.12 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=256
Figure 5.13 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6,
9x9 when nl=256
5.5 Effect of Spatial and Temporal Locality on a P-Node Twin-Prefetching
Multiprocessor when Shared-Bus-Width is 512
Tables 5.13, 5.14 and 5.15, and figures 5.14, 5.15 and 5.16 show how the proposed twin-
prefetching system performs as the size of template matrix (amount of data locality)
varies, when shared-bus-width is 512. The shared-bus is wide enough to supply the
necessary amount of bandwidth and effectively support 64 processors when template
matrices 3x3, 6x6, and 9x9 are used. Therefore, there is little room to observe system
performance improvement due to data locality. The comparison of last entries of the first
two columns in Tables 5.13, 5.14 and 5.15 is the only possible demonstration of system
performance, i.e., when a 2x2 template matrix (locality 4) is used, 32 processors
81
effectively execute the application while if a 3x3 template matrix (locality 81) is used the
number of processors increases to 64. The significant improvement of system efficiency
is solely due to greater amount of data locality in the latter application.
Table 5.13 Execution Time vs. Number of Processors for template matrix=2x2, 3x3,
6x6, 9x9 when nl=512
Figure 5.14 Execution Time vs. Number of Processors for template matrix=2x2, 3x3,
6x6, 9x9 when nl=512
82
Table 5.14 Speedup vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=512
Figure 5.15 Speedup vs. Number of Processors for tempiate matrix=2x2, 3x3, 6x6, 9x9
when n1=512
8•
Table 5.15 Efficiency vs. Number of Processors for template matrix=2x2, 3x3, 6x6, 9x9
when nl=512




In the convolution algorithm, if a output matrix rows are to be produced α+mr-1 rows
have to be present (assuming partitioning by dividing total number of rows by P). Every
processor in a message passing multicomputer system would have to transfer
(communicate) the m r-1 rows from other nodes. In a multiprocessor system every node
retrieves from shared-memory an extra m r-1 rows in order to produce the a output rows.
Thus, a P-node parallel computer system would have to communicate at least P(m r-1)
extra rows of data to convolute an image. The ratio (mr-1)/α is a direct measure of the
communication overhead and its value is expected to be small (much smaller than 1). We
allow ratio (m r-1)/a to take values much larger that one in order to test the performance
of the proposed architecture with applications employing large amount of communication
overhead. Table 5.16 consists of execution times when a=1; thus, the ratio (m r-1)/α
becomes 1/1, 2/1, 5/1, and 8/1. Table 5.17 consists of execution times when a=2; thus,
the ratio (mr-1)/α becomes 1/2, 2/2, 5/2, and 8/2. The difference between corresponding
timing entries in Table 5.16 and 5.17 is small. Insignificant is also the difference of
entries in Table 5.16 and 5.17 and the corresponding entries in Table 5.13 for which
a=32. This strongly indicates that the proposed system could handle applications with
large amount of communication.
Table 5.16 Execution Time vs. Number of Processors when a=1 (nl=512)
85
Table 5.17 Execution Time vs. Number of Processors when a=2 (nl=512)
CHAPTER 6
DETERMINATION OF THE OPTIMAL SIZE OF PREFETCHING CACHES
In this chapter we investigate the effect of various prefetching cache sizes on the
performance of all configurations of the twin-prefetching shared-memory multiprocessor
system. The prefetching cache was varied (ten different values) from a few Kbytes to 4
Mbytes to determine the cost-performance optimal size. The twin-prefetching cache's
function is to temporarily hold blocks of input and output data. Its size is an important
design parameter, affecting performance, cost, and volume of the multiprocessor system.
Four different applications (Convolution utilizing four different template
matrices) were executed on the 350 configurations of hardware parameters (varying
shared-bus-width, number of processors and prefetching cache size). From the 1400 runs
(Appendix C), a careful collection of results, shown in Tables 6.4, 6.5, 6.6, 6.7, 6.8, 6.9,
6.10, 6.11 6.12, 6.13, and 6.14 enable the optimal prefetching cache size selection for
every P-node based system.
An important observation, which assists the analysis of results and involves the
availability of shared-bus bandwidth and categorization of applications, is described in
Section 6.1. The reasoning behind the selection of results is explained in Section 6.2.
Results are analyzed in Section 6.3 and presented in Sections 6.4-6.10. A discussion of
results is given in Section 6.11.
The notation used to label variables in the tables is as follows: P stands for the
number of processors, nl stands for the shared-bus-width, tm. wn stands for the size of the
86
87
template window, cszvr stands for the prefetching cache size, csz stands for the size of the
smallest prefetching cache for which the best execution time occurs, L95% stands for the
smallest cache size that gives at least 95% of the performance of csz, and Time stands for
the execution time of the application.
Prefetching cache size is measured in Kbytes while execution time is measured
inseconds. All execution times of applications are based on the Convolution of eight
continuous images having resolution 1024x1024.
6.1 Shared Bus Contention Cases
Chapters 2 and 3 establish that all applications executed on the twin-prefetching shared
memory multiprocessor fall into two cases of shared-bus bandwidth, depending on
Twratio. When Twratio is greater or equal to P there is enough bus bandwidth for all
processing elements, while bottleneck appears for smaller values of Twratio. More
clearly Case I and Case II are stated as
I. Twratio P adequate shared-bus bandwidth for all processors
II. Twratio <P bus contention - bottleneck
For a specific P-node based system and a specific application we test ten different
values for the prefetching cache size. For seven different values of P, five different values
of nl, and four applications, 140 sets of timings were recorded (Appendix C). After a
detailed study of all sets of timings, it was observed that some sets of timings included a
minimum value of time and some sets didn't. An example of no minimum occurrence is
88
shown in Table 6.1. Two examples of minimum occurrence are shown in Table 6.2 and
Table 6.3.
Table 6.1 Prefetching cache size vs. time when P=8, nl=32, and template window=2x2
Table 6.2 Prefetching cache size vs. time when P=8, n1=128, and template window=2x2
Table 6.3 Prefetching cache size vs. time when P=8, nl=512, and template window=2x2
Further more it was observed that a minimum occurs when Twratio is greater or equal to
P/2 and no minimum value occurs for Twratio less than P/2. In other words Case I,
above, divides into two cases, and thus
I. Twratio <P/2
	 (great need of bus bandwidth-bottleneck)
II. P/2 > Twratio > P (need of bus bandwidth-bottleneck)
III.	 Twratio P	 (bus bandwidth availability)
89
The importance of the three cases above is that they point out the amount the shared-bus
contention. Thus, by looking at a table entry (Table 6.4 through Table 6.14) we observe
not only the prefetching cache size needed for the particular hardware configuration but
also how good the performance of the system is, i.e., the less the bus contention the better
the execution time and the better the performance. Timings falling within Case III
(shared-bus bandwidth availability for all processors) are accompanied by double quotes
( 1 ') on the right side of their value. Timings falling within Case II (shared-bus bandwidth
in need) are accompanied by single quotes (') on the right side of their value. Lastly,
timings falling within Case I (shared-bus bandwidth in great need) are accompanied by
no quotes on the right side of their value.
It should be noted that it is unknown whether larger values of cszvr would have
produced a minimum timing in Case I. No higher values were tried because they are of
no interest to our work for two reasons. One is because results show that a much smaller
prefetching cache size provides performance very close to the maximum performance.
The other reason is that very large sizes of prefetching cache would be inappropriate for
our low cost system.
6.2 Selection of Results
We obtained 140 sets of ten timings each (Appendix C) running four applications on all
variations (seven different values of P, five different values of nl, ten different values of
prefetching cache) of the twin-prefetching P-node multiprocessor system. If a set provides
a minimum timing value (best performance), the corresponding prefetching cache size is
recorded in Tables 6.4 through 6.14 under the label csz (Case I and Case II). Under the
90
label L95% we record the smallest prefetching cache size in a set, which provides at least
95% of the performance of csz (example results given in Table 6.2 and Table 6.3). If no
minimum time occurs (Case I), then no prefetching cache size value is recorded under the
label csz. Since in this case the best time occurs when the largest prefetching cache size is
tried (4 Mbytes) we record as the L95% value the smallest prefetching cache size that
gives at least 95% of the performance achieved at 4 Mbytes (example shown in Table
6.1). Observing timings in Table 6.1 and results in Appendix C, we conclude that the
error of no minimum occurrence is virtually zero. This is due to the fact that timings
which correspond to the larger prefetching cache sizes are very close to the timings which
correspond to much smaller prefetching cache sizes.
Since our goal is to design a low cost DSP-based image-processing
multiprocessor system, the smaller the size of the prefetching cache the better. Results
point to a large gap between csz value and its corresponding L95% value. The difference
in size between csz and L95% increases, as P becomes smaller. It should be noted, for the
better understanding of results and more accurate evaluation of the system, that a
difference of a few tens of Kbytes in the L95% prefetching cache size, when keeping n1
constant and varying the template window, does not imply a bus bandwidth related
prefetching cache size increase or decrease, but one that comes directly from the
algorithm. More precisely, in order to produce k output rows or columns one has to load
into the prefetching cache k+c-1 rows or columns, where c is the number of rows or
columns of the template window. Thus, for the same number of output rows we need to
load different number of input data rows depending on the size of the template window.
This variation is small and negligible reaching a maximum value of 29 Kbytes. The
91
number 29 Kbytes could be calculated as follows. The largest template window used is
9x9 and the smallest 2x2. The respective extra rows loaded are eight and one. Their
difference is seven. Every row of the input image is 1024 pixels. Since pixels are store as
32-bit elements (32-bit DSP-four bytes each pixel) in the prefetching cache the maximum
of seven extra rows would take 4x1024x7=28 Kbytes of memory space. There is one
Kbyte of difference between the observed and calculated maximum variation. The
additional one is the round up Kbyte from the wrap around operation of column pixels.
Therefore, 28+1=29. This difference could be seen in some column results of Tables 6.8,
6.9, 6.10, and 6.11 where the minimum value is 12 Kbytes and the maximum 41 Kbytes.
6.3 Analysis of Results
In Tables 6.4, 6.5, 6.6, and 6.7 we observe the changes of csz and L95%, as we vary P
and nl, when template window is 2x2, 3x3, 6x6, and 9x9, respectively. In Tables 6.8, 6.9,
6.10, 6.11, 6.12, 6.13, and 6.14 we observe the changes of csz and L95%, for a particular
P-node twin-prefetching multiprocessor for all changes of the template window and nl.
Examining results in Tables 6.4, 6.5, 6.6, and 6.7 we observe the need for less
prefetching cache (L95%) as nl increases. There is also a greater need of prefetching
cache (L95%) as the number of processors (P) increases. It is also evident that the csz and
the L95% values, approach each other (much faster from the csz side) for higher values of
nl and P. We also observe the need for a larger prefetching cache size (L95%) when
moving from one category of bus bandwidth to another, i.e. from double quotes to single
to no quotes. These observations are to be considered when designing a twin-prefetching
92
multiprocessor. It should be mentioned that the emphasis on the L95% values in our
analysis is due to cost attributes.
The most significant observation (Tables 6.4-6.14) is that L95% values, for all
changes of hardware or software parameters, fall within the range of 12 to 290 Kbytes.
In other words, any twin-prefetching multiprocessor system tested could obtain almost its
maximum performance using only a small prefetching cache.
Let us now focus on Tables 6.8, 6.9, 6.10, 6.11, 6.12, 6.13 and 6.14 where we
concentrate specifically on the cost-performance optimal prefetching cache size of a
particular P-node twin-prefetching multiprocessor. For every specific value of P and nl
(different shared-bus availability) four applications are executed and four different
prefetching cache sizes (L95%) are utilized. We select as optimal prefetching cache size
the largest of the four sizes. In other words, we select the prefetching cache size that
covers all applications executed on the system. It is important to note that we consider
optimal value the cost-performance optimal prefetching cache size value (L95%).
6.4 Optimal Selection of Prefetching Cache Size when P=1
When P=1 (Table 6.8) csz is in the range of Mbytes while the L95% value is always less
than 100 Kbytes. Respective L95% values for n1=32, 64, 128, 256, 512 are 65, 49, 41, 41,
41 Kbytes. The largest L95% prefetching cache size value (65 Kbytes) occurs when nl=32
and template window=9x9, and the smallest prefetching cache size (12 Kbytes) occurs
when nl=256 or nl=512 and template window=2x2. To be more precise, when nl=32 and
template window=9x9, 65Kbytes of prefetching cache provide 96.5% of the maximum
performance while when nl=256 and template window=2x2 the prefetching cache size
93
provides 99% of the maximum performance (timings for calculations of L95% values are
taken from Appendix C). The above example L95% values are presented in order to
demonstrate how close to the best performance is the at-least-95%-of-the-best-
performance.
Comparing Table 6.8 with Tables 6.9-6.14 we observe that the difference between
csz and L95% is greater for smaller values of P. We notice that for all changing values of
nl and template window, there is enough bus bandwidth (double quotes next to every
value in the Table 6.8). (It is a little strange talking about bus bandwidth availability
when P-1 but let us be reminded that there are two twin controllers utilizing the bus.)
We could finally say that a minimum size value of 65 Kbytes (a small memory
size in today's technology) covers all hardware and software requirements when P=1.
6.5 Optimal Selection of Prefetching Cache Size when P=2
When P=2 (Table 6.9) csz is in the range of Mbytes (except when the template
window=2x2 where csz is half a Mbyte) while L95% values are always less than 100
Kbytes. Respective L95% values for n1=32, 64, 128, 256, 512 are 65, 49, 41, 41, 41
Kbytes. The largest L95% prefetching cache size value (65 Kbytes) occurs when nl=32
and template window=9x9, and the smallest prefetching cache size (12 Kbytes) occurs
when nl=256 or nl=512 and template window=2x2.
We notice that for all changing values of nl and tempiate window, there is enough
bus bandwidth (double quotes next to almost every value in the Table 6.9).
Finally, we could say that a minimum size value of 65 Kbytes covers all hardware
and software requirements when P=2.
94
6.6 Optimal Selection of Prefetching Cache Size when P=4
For P=4 (Table 6.10) csz takes values in the range 133Kbytes<csz<4Mbytes (4Mbytes are
used as the upper limit of csz because there is one case of no minimum convergence in
the Table) while L95% values are all less than 137 Kbytes. For n1=32, 64, 128, 256, 512,
L95% values are 137, 68, 41, 41, 41 Kbytes respectively. The largest L95% prefetching
cache size value (137 Kbytes) occurs when n1-32 and template window=3x3, and the
smallest prefetching cache size (12 Kbytes) occurs when nl=256 or nl=512 and template
window=2x2.
When nl=32 and tempiate window=2x2, the application falls in Case I (no quotes-
great need of bus bandwidth). For nl=32 & template window=3x3 and n1=64 & template
window=2x2 , the application falls in Case II (single quotes - need of bus bandwidth. All
other hardware and software combinations fall into Case III.
We could finally say that when P=4, a minimum prefetching cache size value of
137 Kbytes covers all hardware and software requirements.
6.7 Optimal Selection of Prefetching Cache Size when P=8
Table 6.11 (P=8) provides csz values in the range 68Kbytes<csz<4Mbytes (4Mbytes are
used as the upper limit of csz because there is at least one case of no minimum
convergence in Table 6.11) while all L95% sizes receive values less than 137 Kbytes. For
n1=32, 64, 128, 256, 512, L95% values are 137, 137, 68, 41, 41 Kbytes respectively. The
largest L95% prefetching cache size value (137 Kbytes) occurs when nl=32 and template
95
window=3x3, and the smallest prefetching cache size (12 Kbytes) occurs when n1.--256 or
nl=512 and template window=2x2.
For n1=32, 64 and template window=2x2, 3x3 we observe some shortage of
shared-bus bandwidth while all other Table entries fall in Case III (shared-bus bandwidth
availability).
We could finally say that when P=8, a minimum prefetching cache size value of
137 Kbytes covers all hardware and software requirements.
6.8 Optimal Selection of Prefetching Cache Size when P=16
Table 6.12 (P=16) provides csz values in the range 36Kbytescsz4Mbytes (4Mbytes are
used as the upper limit of csz because there is at least one case of no minimum
convergence in Table 6.11) while all L95% sizes receive values less than 149 Kbytes. For
n1=32, 64, 128, 256, 512, L95% values are 149, 137, 133, 68, 41 Kbytes respectively. The
largest L95% prefetching cache size value (149 Kbytes) occurs when nl=32 and tempiate
window=6x6, and the smallest prefetching cache size (16 Kbytes) occurs when nl=512
and template window=3x3.
For n1=32, 64, 128 and template window=2x2, 3x3, 6x6 we observe some
shortage of shared-bus bandwidth while all other Table entries fall in Case III (shared-
bus bandwidth availability).
We could finally say that when P=16, a minimum prefetching cache size value of
149 Kbytes covers all hardware and software requirements.
96
6.9 Optimal Selection of Prefetching Cache Size when P=32
Table 6.13 (P=32) provides csz values in the range 40Kbytes<csz<4Mbytes (4Mbytes are
used as the upper limit of csz because there is at least one case of no minimum
convergence in Table 6.11) while all L95% sizes receive values less than 277 Kbytes. For
n1=32, 64, 128, 256, 512 L95% values are 277, 149, 137, 133, 68 Kbytes respectively.
The largest L95% prefetching cache size value (277 Kbytes) occurs when nl=32 and
template window=6x6, and the smallest prefetching cache size (24 Kbytes) occurs when
nl=512 and template window=3x3.
A severe shortage of shared-bus bandwidth is observed when nl=32, 64, and a
milder shortage when nl=128, 256. Still, nl=512 is able to effectively support all
applications executed on the system.
We could finally say that when P=32, a minimum prefetching cache size value of
277 Kbytes covers all hardware and software requirements.
6.10 Optimal Selection of Prefetching Cache Size when P=64
Table 6.14 (P=64) provides csz values in the range 65Kbytes<csz<4Mbytes (4Mbytes are
used as the upper limit of csz because there is at least one case of no minimum
convergence in Table 6.11) while all L95% sizes receive values less than 290 Kbytes. For
n1=32, 64, 128, 256, 512 L95% values are 290, 290, 277, 149, 133 Kbytes respectively.
The largest L95% prefetching cache size value (290Kbytes) occurs when nl=37 and
template window=6x6, and the smallest prefetching cache size (41 Kbytes) occurs when
nl=512 and template window=9x9.
97
A severe shortage of shared-bus bandwidth is observed when nl=32, 64, 128 and
a milder shortage when nl=256. Still, n1=512 is able to effectively support applications
with template window 3x3, 6x6 and 9x9.
We could finally say that when P-64, a minimum prefetching cache size value of
290 Kbytes covers all hardware and software requirements.
6.11 Discussion of Results
Results show that by utilizing a small prefetching cache size (12 to 290 Kbytes) we
achieve at least 95% of maximum performance of any P-based twin-prefetching
multiprocessor (P=1, 2, 4, 8, 16, 32, 64). Despite its low cost, this small amount of
prefetching cache size along with twin-prefetching mechanism and a wider shared-bus
makes possible the 100% utilization of all processors' bandwidth.
Furthermore, due to the range 12Kbytes<95%<290Kbytes, we conclude that the
prefetching cache size is relatively insensitive to the amount of temporal and spatial
locality embedded in different applications. For the same reason the prefetching cache
size is relatively insensitive to shared-bus bandwidth and number of processors. The
insensitivity of the prefetching cache to the above parameters is extremely important for
the hardware implementation of a twin-prefetching system.
Table 6.15 shows the sizes of prefetching cache which should be utilized in the
implementation of a twin-prefetching multiprocessor with number of processors P and
shared-bus width nl. Every entry in Table 6.15 is the L95% value, which allows at least
95% of the best performance of all applications executed.
98
290 Kbytes (maximum amount of prefetching cache in Table 6.15) is a small
quantity of fast memory in today's advanced technology. Chapter 4 suggests that
configurations (P and nI) of j1?vinprefetching multiprocessor systems shown in Table
6.15, employing j1?vinprefetching caches greater than 150 Kbytes should not be
implemented because of a lack of shared-bus bandwidth availability. Chapter 4 also
suggest that configurations of j1?vinprefetching multiprocessors summarized in Table
6.15, employing less than 100 Kbytes, are definitely worth building. It is our opinion that
an implementation of a j1?vinprefetching multiprocessors should carry double or triple the
amount shown in Table 6.15. The justification is that a small quantity, additional to the
value shown in Table 6.15, would cost very little and cover any unexpected application
needs.
All P-node win-prefeiching multiprocessors systems were tested with large, high-
resolution (eight 1024x1024) images. Due to the nature of twin-preferching mechanism,
the real-time execution of even a larger image would not require a larger prefetching
cache.
Table 6.4 Optimal prefetching cache size when template window=2x2
Table 6.5 Optimal prefetching cache size when template window=3X3
99
Table 6.6 Optimal prefetching cache size when tempiate window=6x6
Table 6.7 Optimal prefetching cache size when template window=9x9
Table 6.8 Optimal prefetching cache size when P=1
100
Table 6.9 Optimal prefetching cache size when P=2
Table 6.10 Optimal prefetching cache size when P=4
Table 6.11 Optimal prefetching cache size when P=8
101
Table 6.12 Optimal prefetching cache size when P=16
Table 6.13 Optimal prefetching cache size when P=32
Table 6.14 Optimal prefetching cache size when P=64
Table 6.15 Optimal prefetching cache size
P nl=32 nl=64 n7 128 nl=256 n1=-5 1 2
1 65 49 41 41 41
7 65 49 41 41 41
4 137 68 41 41 41
8 137 137 68 41 41
16 149 137 133 68 41
7732 277 149 137 133 68
64 290 290 277 149 133
CHAPTER 7
CONCLUSIONS
This dissertation introduces, investigates, and evaluates, through simulation, a low-cost
high-speed twin-prefetching DSP-based shared-bus shared-memory multiprocessor
system for real-time image processing applications. The proposed architecture can
effectively support 32 modern high-performance DSPs (ADSP-21060) transcending
promising signals to the area of shared-memory multiprocessors and real-time image-
processing. It should be noted that the number of effectively supported processors (32) is
conservatively selected and it is based on the worst case scenario when the application
with the smallest number of local neighborhood operations (2D convolution with
template matrix 2x2) is executed. Applications employing high number of local
neighborhood operations (e.g., 2D convolution with template matrix 9x9) achieve a
system efficiency greater than 90% even when 64 processors share the bus. The number
of neighborhood operations is a direct measure of the amount of data locality embedded
in a particular application and the above conclusions demonstrate the significant effect of
data locality on the system's performance. The positive response of both kinds of
applications on the proposed multiprocessor system demonstrates its potential to meet, in
a cost effective manner, the challenges not only of real-time digital image processing but
also of other computationally intensive applications.
The cost of the proposed system is kept low because of the following reasons: (1)
single bus-based systems are the least costly; (2) DSPs cost considerably less that RISC
102
103
or CISC microprocessors; and (3) as this work concludes, the amount of fast (expensive)
memory needed for every prefetching cache is small (16 to 300 Kbytes).
In contrast to existing DSP-based multiprocessors, the proposed system sustains
peak performance regardless of image size due to the twin-prefetching mechanism. It
should be noted that in the applications run on recent DSP-based uniprocessor or
multiprocessor systems the majority of instructions are multifunction, each one including
a reference to data memory (where the image resides). It is necessary for the entire
section of the assigned image to be stored on fast memory (SRAM) in order to guarantee
single-cycle execution of all (multifunction) instructions; a basic approach for real-time
DSP-based systems. However, it is quite expensive storing one or a series of input and
output high resolution images on SRAM. It should be noted that even if the entire shared
memory consisted of SRAM in a conventional shared-bus multiprocessor system
(processor and bus having the same data width), still there would not be enough
bandwidth to support more than a handful of processors. Therefore, the twin-prefetching
caches placed between the processor and the shared bus serve two goals that benefit both
cost and performance. One is the feasibility of utilizing a wider bus which provides the
additional bandwidth required to effectively support a greater number of processors. The
other is a significant cost reduction since two inexpensive small-size fast memories
(prefetching caches) achieve the performance of a much larger and more expensive
SRAM. It should be noted that, in the proposed system, the assigned to each processor
image sections, regardless of being large or small are partitioned into smaller segments
which are interchangeably loaded on the Twin1 and Twin2 prefetching caches.
104
Eight consecutive high-resolution images of size 1024x1024 were mainly used to
investigate the processing power of the proposed system architecture. Our performance
analysis has demonstrated the real-time potential of this system. This is of paramount
importance iii areas like digital video processing or robotic vision where processing
sequences of 25 -30 images per second is a common place. Below we provide some
typical result; of processing times that different configurations of our system can achieve
in the case of convolution of 30 consecutive images (1024x1024) when a 3x3 template
matrix is used. When the number of processors P=64 and shared bus width n1=512
execution time T=0.172 seconds; when P=32 and n1=512 T=0.245 seconds; when P=16
and n1=2561=0..480 seconds; when P=8 and n1=256 T=0.940 seconds.
Further research will be contacted to evaluate the performance of the proposed
multiprocessor system with additional digital image processing algorithms and neural
network simulation for pattern recognition and classification. A NUMA shared-memory
parallel computer formed by several clusters of the proposed system will be also
investigated for
 applications requiring even greater computing power.
APPENDIX A
ASSEMBLER CODE FOR 2D CONVOLUTION
Example of assembler code listing for two-dimensional Convolution of a segment of an
image (Mr x Nc) stored in prefetching cache, when template window=3x3:
105
/* The third coefficient is stored in the register file to free up a cycle to output a result.
The first write to the output buffer is unused , the pointer then wraps around to the proper
location at the start of output. */
APPENDIX B
EXECUTION TIME & SPEEDUP VS. SHARED-BUS-WIDTH
Template matrix =2x2 
Table B.1 Execution time vs. Shared-Bus-Width (nl) for P=1, 2, 4, 8, 16, 32 and 64
when tempiate matrix=2x2




Figure B.2 Execution time vs. Shared-Bus-Width (n1) for P=8, 16, 32 and 64 when
template matrix=2x2
Table B.2 Speedup vs. Shared-Bus-Width (nl) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=2x2
109
Figure B.3 Speedup vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=2x2
Template matrix =3x3 
Table B.3 Execution time vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=3x3
110
Figure B.4 Execution time vs. Shared-Bus-Width (d) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=3x3
Figure B.5 Execution time vs. Shared-Bus-Width (nl) for P=8, 16, 32 and 64 when
template matrix=3x3
111
Table BA Speedup vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=3x3
Figure B.6 Speedup vs. Shared-Bus-Width (n1) for P=8, 16, 32 and 64 when template
matrix=3 x3
112
Template matrix =6x6 
Table B.5 Execution time vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=6x6
0
Figure B.7 Execution time vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=6x6
113
Figure B.8 Execution time vs. Shared-Bus-Width (n1) for P=8, 16, 32 and 64 when
template matrix=6x6
Table B.6 Speedup vs. Shared-Bus-Width (nl) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=6x6
114
Figure B.9 Speedup vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=6x6
Template matrix =9x9 
Table B.7 Execution time vs. Shared-Bus-Width (nl) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=9x9
115
Figure B.10 Execution time vs. Shared-Bus-Width (nl) for P=1, 2, 4, 8, 16, 32 and 64
when template matrix=9x9
Figure B.11 Execution time vs. Shared-Bus-Width (nl) for P=8, 16, 32 and 64 when
template matrix=9x9
116
Table B.8 Speedup vs. Shared-Bus-Width (n1) for P=1, 2, 4, 8, 16, 32 and 64 when
template matrix=9x9




Appendix C provides results for all combinations of hardware and software tested. In
order for one to be able to read the results one needs to know that mr stands for the
number of rows of the template matrix, nc stands for the number of columns of the
template matrix, P stands for the number of processors, nl stands for the shared bus
width, Twr stands for Twratio, cacheusd stands for the size of the prefetching cache, T
stands for the execution time of the application on the P-node system, btlnck stands for












































1. Tien-Fu Chen and Jean-Loup Baer, "Effective Hardware-Based Data Prefetching for
High-Performance Processors," IEEE Transactions on Computers, Vol. 44, No. 5, pp,
609-623, May 1995.
2. David J. Lilja, " The Impact of Parallel Loop Scheduling Strategies on Prefetching in
a Shared Memory Multiprocessor," 1EEE Transactions on Parallel and Distributed
Processing, Vol. 5, No. 6, pp. 573-584, June 1994.
3. Todd Mowry and Anoop Gupta, "Tolerating Latency Through Software-Controlled
Prefetching in Shared-Memory Multiprocessors," Journal of Parallel and Distributed
Computing, Vol. 12, pp. 87-106, 1991.
4. Fredrik Dahlgren, Michel Dubois, and Per Stenstrom, "Sequential Hardware
Prefetching in Shared-Memory Multiprocessors," IEEE Transactions on Parallel and
Distributed Processing, Vol. 6, No. 7, pp. 733-746, July 1995.
5. Chung-Ho Chen and Arun K. Somani, "Effects of Cache Traffic on Shared Bus
Multiprocessor Systems," international Conference on Parallel Processing, pp. 285-
288, 1992.
6. S. Chandra, J. R. Larus, and A. Rogers, "What is Time Spent in Message-Passing and
Shared-Memory Programs?," 6th International Conference on Architectural Support
for Programs Langs. and OS's, October 1994.
7. Shin-ichiro Mori, Hiheki Saito, Masahiro Goshima, Mamoru Yanagihara, Takashi
Tanaka, David Fraser, Kazuki Joe, Hiroyuki Nitta, and Shihji Tomita, "A Distributed
Shared Memory Multiprocessor: ASURA - Memory and Cache Architectures,"
Proceedings of Supercomputing, Portland- Oregon, pp. 740-749, November 1993.
8, Norman P. Jouppi, "Improving Direct-Mapped Cache Performance by the Addition of
a Small Fully-Associative and Prefetch Buffers," Computer Architecture News, pp.
364-373, January 1990.
9. Jean-Loup. Baer and T.-F. Chen, "An Effective On-Chip Preloading Scheme to
Reduce Data Access Penalty," Proceedings of Supercomputing'91, pp. 176-186,
November 1991.
10. Ingrid Y. Bucher and Donald A. Calahan, "Models of Access Delays in
Multiprocessor Memories," IEEE Transactions on Parallel and Distributed Systems,
Vol. 3, No. 3, pp. 270-280, May 1992.
159
160
11. John K. Bennett, Sandhya Dwarkadas, Jay Greenwood, and Evan Speight, "Willow:
A Scalable Shared Memory Multiprocessors," Proceedings of Supercomputing'92,
pp. 336-345, 1992.
12. Nayeem Islam and Roy H. Campbell, "Design Considerations for Shared Memory
Multiprocessor Message Systems," IEEE Transactions on Parallel and Distributed
Systems, Vol. 3, No. 6, pp. 702-711, November 1992.
13. Arun Nanda and Lionel M. Ni, "MAD Kernels: An Experimental Testbed to Study
Multiprocessor Memory System Behavior," International Conference on Parallel
Processing, pp. 28-35, 1992.
14. Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich weber, Anoop
Gupta, John Hennessy, Mark Horowitz, and Monica S. Lam, "The Stanford Dash
Multiprocessor," IEEE Computer, pp. 63-79, March 1992.
15. Analog Devices, Norwood, MA, ADSP-2106X SHARCTM User's Manual, 1997.
16. Analog Devices, Norwood, MA, Digital Signal Processing in VLSI, 1990.
17. Andrew S. Tanenbaum, M. Frans Kaashoek, Henri E. Bal, "Parallel Programming
Using Shared Objects and Broadcasting," IEEE Computer, pp. 10-19, November
1992.
18. K. Rajan, K. S. Sangunni, and J. Ramakrishna, "Dual-DSP System for Signal and
Image Processing," Microprocessors and Microsystems, Vol. 17, No. 9, pp. 556-560,
November 1993.
19. Betty Prince, "Memory in the Fast Lane," IEEE Specrtrum, pp. 38-41, February 1994.
20. Oxford Micro Devices Inc., "A236 Parallel Video Digital Signal Processor Chip,"
Data Sheet Summary of Oxford Micro Devices Inc., June 1997.
21. Dave Bursky, "Parallelism Pushes DSP Throughput," Electronic Design, pp. 151-
154, March 1994.
22. Dave Bursky, "Parallel-Processing DSP Chip Delivers Top Speed," Electronic
Design, pp. 43-49, October 1991
23, R. C. Gonzales and P. Wintz, Digital Image Processing, Addison-Wesley, 1987.
24. P. Danielson and S. Levialdi, "Computer Architectures for Pictorial Information
Systems," IEEE Computer, pp. 53-57, November 1981.
161
25. Yasuyuki Okumura, Kazunari Irie, and Ryozo Kishimoto, "Multiprocessor DSP with
Multistage Switching Network for Video Coding," IEEE Transactions on
Communications, Vol. 39, No. 6, pp. 938-946, June 1991.
26. Srinath V. Ramaswamy and Gerald D. Miller, "Multiprocessor DSP Architectures
that Implement the FCT Based JPEG Still Picture Image Compression Algorithm
with Arithmetic Coding," IEEE Transactions on Consumer Electronics, pp. 1-5,
October 1992.
27. Marc Duranton, "Image Processing by Neural Networks," IEEE Micro, pp. 12-19,
October 1996.
28. Erick Hagersten, Anders Landin, and Seif Haridi, "DDM - A Cache-Only Memory
Architecture," IEEE Computer, pp. 44-54, September 1992.
29. John Brian, "DSP: The Secret is Out," PC Computing, April 1994.
30. Roland L. Lee, Pen-Chung Yew, and Duncan H. Lawrie, "Data Prefetching in Shared
Memory Multiprocessors," Proceedings of the International Conference on Parallel
Processing, pp. 28-31, 1987.
31. Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis, Introduction to
Parallel Computing, Benjamin Cummings, Redwood City, California, 1994.
32. Kai Hwang, Advanced Computer Architecture with Parallei Programming, McGraw
Hill, Englewood Cliffs, NJ, 1993.
33. Ted G. Lewis, "Where is Computing Headed," IEEE Computer, Vol. 27, No. 8, pp.
59-63 August 1994.
34. Arvind, "Prospects of Ubiquitous Parallel Computing," Proceedings of Parallel
Processing Symposium, 8th Int'l (ICPP 94), IEEE CS Press, Los Alamitos,
California, Order No. 5602-02U, pp. 2, April 1994.
35. Ted G. Lewis, "Supercomputers Ain't so Super," IEEE Computer, pp. 5-8, November
1994.
36. Per Stenstrom, "A Survey of Cache Coherence Scheme for Multiprocessors," IEEE
Computer, Vol. 23, No. 6, pp. 12-24, June 1990.
37. Kevin Dowd, High Performance Computing, O'Reilly&Associates Inc., California,
1993.
162
38. Chang J. H., Ibarra 0., Pong T.C., and Sohn S., "Convolution on a Pyramid
Computer," Proceedings of the International Conference on Parallel Processing, pp.
780-782, 1987
39. Ranka S. and Sanhi S., "Convolution on SIMD Mesh Connected Multicomputers,"
Proceedings of the International Conference on Parallel Processing, pp. 212-216,
1987
40. Prasanna Kumar V. and Krishnan V., "Efficient Image Template Matching on SIMD
Hypercube Machines," Proceedings of the International Conference on Parallel
Processing, pp. 765-771, 1987
41. Ranka S. and Sanhi S., "Image Template Matching on MIMD Hypercube
Multicomputer," Proceedings of the International Conference on Parallel
Processing, pp. 92-99, 1988
