Processor Topologies for Image Processing Applications by Wong, Kit Sai
PROCESSOR TOPOLOGIES FOR IMAGE PROCESS I HG 
APPl I CAli OHS 
KIT SAl WOHG 
Bachelor of Engineering 
Hational Uniuersity of Singapore 
Republic of Singapore 
1985 
Sublltitted to the Faculty of t he 
Graduate Coll ege of the 
OklahoRa state Uniuers ity 
in partial fulfill.ent of 
the requirelltents for 
the Degree of 
HASTER OF SCI EHCE 
Hay, 1997 
PROCESSOR TOPOLOGIES FOR IMAGE PROCESSIHG 
APPLICATIOHS 
Thesis Approved: 
a 
Dean of the Graduate College 
ii 
PREFACE 
Arabnia's process-and-data-decoRposition approach can be 
applied to many image processing jobs other than image rotation. [ 
haue deueloped a generalized topology for illlage processing. I 
inuestigate the efficiency. speed-up of the generalized topology 
and the factors which affect the perforlllance of the generalized 
topology. But the generalized topology has a drawback, it is not 
load balanced in most cases. So I further deuelop a linked-trees 
topology with better load balance capability. And the performances 
such as efficiency and speed-up of the linked-trees topology are 
inuestigated. And I analyze the factors affecting these 
performances. In conclusion, the linked-trees topology is superior 
in time efficiency and speed-up than the generalized topology. 
iii 
ACKHOWlEDGMEHT 
First and forel'lKlst, I would like to express ny sincere 
appreciation to my M.S. thesis aduisor. Dr. K. M. George for his 
consistent and continuous aduice, guidance, kindness, and ualuable 
instruction through 11'9 graduate work. 
Special thanks are due to ny aduisory conmittee neRbers , 
Dr. J. P. Chandler, Dr. H. lu, for their aduice, support, and 
ualuable comnents on my research. 
I would like to take this opportunity to thank ny parents, 
Wong Man Sum and Pong Yip Hoi, who haue prouided consistent support 
and endless loue in my whole life. Also, I would l i ke to thank my 
sisters, Wong Sai Wah and Wong Sai Yuet for thei r encouragement and 
care. 
iu 
TABLE OF COHTEHTS 
Chapter Page 
I. IHTRODUCTIOH ....••••..••...........•..••..•••.....•........... 1 
1.1 Image Processing and Parallel Processing •.••.....••... 1 
1.2 Scope and Objectiues •.........•••••••..••.•.••......•. 2 
1.2.1 Objective I ............•••..•••........•.... ~ 
1.2.2 Objectiue 2 ...•.•••.•..•.•••.•••..•••.....•. 3 
1.3 Organization of the Thesis •....••••.•..•..•.•.•.•..... 4 
II. RELATED WORKS ..•.•....•.............••.•.•.••••••••.••....•.. 5 
2.1 Machine Uision Quality Inspection System for Textile 
Industries .•.•........•............•.•...•...•...•..•. 5 
2.2 Evaluation of Transputer-based Architecture for I~age 
Compression and Reconstruction ..•••....•.•..•.•..•••.. 6 
2.3 Image Processing on a Transputer-based Perfect Shuffle 
Mac hine ...........•.•...••••............••.••.•.•...•. 6 
2.4 Distributing Pixmap Images Among Parallel Disk Arrays.7 
2.5 Arbitrary Rotation of Digitized I~ages Using Process-
and-Data-Decomposition Approach ........•..••..•......• 7 
2.6 I mage Processing Methods ..•...•..•.....•.......•...•.. 9 
III. TRAHSPUTER HETWORK ....•.....•.•.•.....................••.... 11 
3.1 Introduction to Transputer ...........•..•...•••...•... 11 
3.2 Generalization of the Process- and- Data- Decomposition 
Approach •........•..•.•...•...•...•..•••.••••..••.••.• 12 
3.3 Factors Likely to Affect Speed-up and Time Efficiency.18 
3.4 Experiment ...••.........••.•••.•........•..••...•..... 20 
3.4.1 A Simple Implementation of Farming or SIMD 
Topology ...•..••.....•..••..••.•...••...•... 20 
3.4.2 Execution Time of the Algorithms •••.•.•..••• 21 
3.4.3 Pipelined I~plementation of Three 
Processes ...•••••••.••.••.•..••...•.•.•••••• 22 
3.5 The Types of Pipelining ...•••.....•.....•...••••••...• 22 
3.6 Linked-Trees Topology •.....•••...•.•.•...••.•.•.•••••• 31 
3.6.1 Data partitioning in linked- trees topology .• 31 
3.6.2 Illustration by Examples .•..•..••••.••••..•. 34 
3.7 Results ........•.••........•..•....••....•..•.•••.•.•. 35 
3.7.1 A Si~le Implementation of Farming or SIMD 
Topology •.••....•.•...••.•.• " •...••.•..•... 35 
3.7.2 Pipelining of Images and Pipelining of 
Segments ........•••..•..•..•...•........•... 41 
3.7:3 Linked-Trees Topology Experiments •.......... 42 
IU COHCLUSIOH ........................................•........... 49 
BIBLIOGRAPHy ........................................................ 51 
u 
LIST OF FIGURES 
Figure Page 
1. The Transputer network for an image space divided into four 
blocks ••...•...••••..••.••..••••••• • .••••••••••••••••.••.••••. 15 
2. Generalized topology for process-and-data decoRposition approach 
using transputers for parallel image processing ..••••..••.•••• 16 
3. Arabnia"s two rings network is a version of the generalized 
topology with two leuels and four nodes in each level ...•...• 17 
4. Transputer chain network used for' the speed-up and tillle 
efficiency experiments ....••......•..•......•.•.••.•.......•. 24 
5. The data and operation flow of the three nodes chain network •. 24 
6. A 3 x 3 lIlesh network using process-and-data decomposition 
apPr'oach ...•.........•...........•......••......•..•..•.....•. 25 
7. Data and operations flow of pipelining of images ...•.••....... 26 
8. Data and operations flow of pipelining of seglllents ...........• 29 
9. (a) IAage pipelining. 
(b) Segment pipeline timing .....•.•.•.••••...•••..•••.•...•.•• 3 0 
10. Linked-trees topology .••••.•.•••••.•••••••••••.•.•••.••..•••. 33 
11. Experiment results by executing various image tas ks on a 
farPling topology ......•.•..••....•..•.•..•.....•.......•....• 38 
12. The graphS for experiAent results ..•....•..........•.•....... 39 
13. Results and calculations for pipelining of i mages and 
pipelining of segments .....•...•......•....•............•• . •. 40 
14. Results and calculations for pipelining through the 
linked-trees topology .•...........•.••...................•... 45 
15. The results of the three linked-trees when th£>y ar,e processed 
individually. " ............................................... 46 
vi 
16. (a) The effect of number of nodes in the linked-tree for 
blurring on the efficiency of the whole linked-trees 
topology .••.•..•.••••••.•......•.•••.•.•..•.. " ••••..•••.• 47 
(b) Effect of the nunber of nodes in the linked-tree for 
blurring on the speed-up ratio of the whole linked- trees 
topology .•.....•••.••.. _ .•.••.••••••...•••.•..•••••••.•.• 47 
17. Conparison of network efficiencies and nunber of transputers 
used for generalized topology and linked-trees topology 
acconplishing the' sane speed-up ratio ••••••.••..••••••••••••• 41 
vii 
CHAPTER I 
IHTRODUCTIOH 
1.1 IRage processing and parallel processing 
Serial processing on a digital computer is probably the 
most common method of i~plementing pattern/picture processing 
operations even today. For many image processing applications 
pixel-by-pixel processing is necessary. For example. if we are 
processing an image of 512 x 512 pixels. the same operation will 
have to be repeated about 250000 times. So a serial computer i s not 
the most time efficient processor considering the inherently 
iterative nature of many image processing operations; particularly. 
when time constraints are important. Furthermore. serial processing 
is certainly not like the biological systems of human beings. such 
as eyes and brain [Fairhurst88]. 
Depending on the type of image processing job. it may take 
minutes to process an image using serial processing. which i s not 
suitable for many real-time applications. On the other hand, 
parallel processing provides a more effective corres pondence with 
the inherent parallelism in many tas ks of interest in image 
processing. Great improvement in computing ~ower and" s hortening of 
processing time can be achieved by parallel processing. 
The above mentioned observations have led to the 
exploration of parallel processing architecture for image 
processing [Uhr87].[ Tomohisu93].[Morrow91].[Crookes89]. The 
2 
different computer architecture ~odels used for parallel image data 
processing are SIMD and HIMD architectures. These 11l0dels fall i nto 
Flynn's taxonomy of computer architectures [Dasgupta89]. This 
thesis proposes new topologies for transputer based networks 
suitable for image processing. The research presented in this 
thesis is based on transputer network. 
1.2 Scope and objectiues 
1.2.1 Objectiue I 
The first objective of this study is to deuelop a 
generalized topology for image processing based on Arabnia's 
process-and-data-decomposition approach [Arabnia90] for image 
transformation, which is described in chapter II. In the 
generalized topology (described later). there are parallel 
computation of different segments of the image and paralle l 
execution of different subprocesses of a process, like Arabnia' s 
network. 
3 
1.2.2 Objectiue 2 
As outlined later in chapter II, Arabnia presents a network 
consisting of two rings of transputers [Arabnia90]. He al s o 
presents a parallel rotation algorithm. In Arabnia's rotation 
algorithm, the inner ring processes are less time consu~ing than 
the outer ring processes. That means the load on the two r i ngs is 
not balanced. Arabnia suggests that it is not necessary to haue 
transputers of the same processing power in both rings. For 
example, he suggests that 20 MIPS transputers could be used to 
execute outer ring processes and 10 MIPS tranputers could be used 
to execute inner ring processes to achieve load balancing. Th at 
means the inner ring will not wait idle for the outer ring to 
complete its job. 
But this is not an ideal solution to the network' s load 
balancing problem, because a computer engineer may only haue one 
type of transputers available. Al s o. balanc i ng of load i s intended 
to increase efficiency of a transputer network. but del i be r a te s low 
down of one component to 10 MIPS to accomplish load balancing i s 
not ideal. Since the generalized topology is d~v eloped fr om 
Arabnia's topology. it also inherits the load bal ancing problem. 
So, a second objective of this s t udy is to search and 
propose an alternate solution f or balancing the loa d of different 
processes and subprocesses in the generalized t r ansputer ne t work 
which uses the process-and-data-decomposition approach fo r i mage 
4 
processing. We use a sequence of processes on an i~age as an 
example to illustrate the effectiueness of the proposed network 
with load balance. In addition. the speed-up and t i me efficiency of 
the proposed network are inuestigated and they are compared with 
those of the unbalanced generalized topology to show the aduantages 
of having a load balanced network i n terms of speed-up and time 
efficiency. 
1.3 Organization of the thesis 
The reminder of this thesis is organized as follows. 
Chapter II describes the related works of Karkanis on machine 
vision quality inspection, image compression and reconstruction by 
Antola, process-and-data-deco~position approach by Arabni a , and 
other related works. Chapter III gives an i ntroduct i on to 
transputers. Generalization of the process- and-dat a-decomposi t i on 
approach of Arabnia is given in chapter III. It then mentions 
factors likely to affect the tine efficiency and speed-up of the 
generalized topology. It investigates the effici ency and speed- up 
of the generalized topology by means of experiments. It introduces 
the linked-trees topology to solue load balancing problem of the 
generalized topology. It investigates the speed- up, time ef ficiency 
and load balance of the linked-trees topology. Then results of the 
experiments are given. Chapter IU summarizes the results of the 
investigations and concludes the thesis. 
CHAPTER II 
RELATED WORKS 
In this chapter, we examine the works related to the 
research undertaken in this thesis. 
2.1 Machine vision quality inspection system 
for textile industries 
5 • 
Karkanis presents a machine-vision based quality inspection 
system which can be applied for textile inspection. The objective 
is to increase the inspection speed and improve the quality 
assurance performance. It is implemented using a transputer network 
[Karkanis89]. 
The image processing algorithms were implemented on a 
parallel architecture allowing concurrent processing of different 
parts of an image at the same time with the same set of 
instructions. In other words. it is a SIMD architecture 
[Dasgupta89]. 
Uarious topologies containing four or nine transputers in 
chain network and tree network have been tried. 
2.2 Eualuation of transputer-based architecture for 
i~age co.pression and reconstruction 
6 
hio different methods of obtaini ng parallelism are 
evaluated for the execution of the 2D-DCT (two-dimensional Dis cr ete 
Cosine Transform). The first one, is the decomposition of the 
algorithm into a number of steps, each executed by different 
transputers simultaneously (algorithm parallelism). The second one 
is decomposition of data into blocks to be transformed in paral l el. 
Such architecture is also called farming processi ng [Rntol a91]. 
2.3 Image Processing on a Transputer-based Perfect 
Shuffle Hachine 
A transputer-based parallel computer has been de ueloped for 
image processing. This is a multiple instruction multiple 
data(11IMO) architecture. The computer is built using an one-
dimensional array of processing nodes with neares t neighbor 
connections and perfect shuffle network. Several examples are used 
to demonstrate that this machine is capable of efficiently solving 
the typical computational tasks of low- and medium-Ieuel i ma ge 
processing [Schomberg89]. 
2._ Distributing pixlilap i.ages alliong parallel disk 
arrays 
7 
Professionals in uarious scientific fields(for example. 
medical imaging and ciuil engineering) require rapid access to very 
large amounts of pixmap image data. Browsing through large pixmap 
images requires segmentation of the image into rectangular areas. 
which can be retrieued from disks or cache on demand. A seruer 
architecture that consists of four image-handling processors and 
eight disk nodes is proposed in [Hersch931. It is reported that 
the seruer architecture is cost-effectiue. 
2.S Arbitrary rotation of digitized i_ages 
using process-and-data-decomposition approach 
Arabnia [Arabnia90] presents a process -and- data-
decomposition approach using transputers for rotat i on of digi t i zed 
images. This approach can be applied not only for rotation of 
image, but also for many other image processing jobs such as 
Laplace edge enhancement [Lindley91]. Arabni a's process-and - data-
decomposition approach is outlined below: 
Ca) Process decomposition - the decomposition of a proce ss 
into a number of subprocesses and the mapping of each subproces s to 
a processor for execution. Here. a set of concurrent processes 
operate simultaneously and cooperatiuely to solue a giuen problem. 
This approach is equivalent to the MIMD machine architecture 
[Dasgupta89]. 
8 
(b) Data decomposition - the decomposition of data into 
smaller portions (they are not necessarily equal) and the mapping 
of each portion of data to a processor for execution. This approach 
is used in both MIMD and SIMD machine architecture [Oasgupta89]. 
(cl Process and data decomposition - this approach can be 
regarded as the combination of decompositions (a) and (b). 
Arabnia uses a transputer network partitioned into two 
rings as shown in Fig. 1(b). The image is divided into four blocks, 
each assigned to a transputer node in the outer ring for processing 
as shown in Fig.1(a), (b). The algorithm for rotation can be 
divided into two major processes, one for each of the two rings of 
transputers. These two processes are executed simultaneously. 
Both the data portions processed and the two major 
processes are executed simultaneously at the same time. 
It is interesting to note that Antola also uses the 
decomposition of processes and decomposition of data, but 
separately [Antola91]. But in the case of Arabnia, he uses a 
combination of both methods to achieve greater parallelism. This 
thesis is based on the process-and-data-decomposition approach of 
Arabnia. 
9 ' 
In this section, the author briefly introduces the image 
enhancement methods used in experiments conducted as part of this 
research. The methods described are Laplace edge enhancement, image 
smoothing. image thresholding. and image blurring. 
Laplace Edge Enhancement- local. or neighborhood, 
equalization of image contrast produces an increase in local 
contrast at boundaries. This has the effect of making edges easier 
for the viewer to see, consequently making the image appear sharper 
[Philips094]. 
Image Smoothing- in a 3 x 3 pixel area, all pixel grey 
scale values are averaged to produce the resulting value for the 
pixel at the center of the area. This is done for euery pixel in 
the whole image [Russ92]. 
Image Thresholding- a range of brightness values i s defined 
by means of a threshold value. Select the pixels withi n thi s range 
as belonging to the foreground, and reject all the other pixels to 
the ~ackground. Such an image is then usually displayed as binary 
or two-level image [Russ92]. 
Image Dlurring- in a 3 x 3 pixel area, all pi xel values are 
changed to the same value as the pixel at the center of the area. 
This is done for all the 3 x 3 pixel areas in the image [Russ92]. 
The iflage used in the expe~iflents in this thesis is an 
iroage of a book coue~. 
10 
CHAPTER II I 
TRAHSPUTER NETWORK 
3.1 Introduction to transputer 
11 
The transputer gets its name by combining the words -
transistor and computer. This device was deueloped by I HMOS Ltd of 
Britain, and it is based on the lIlultiple-i nstruction Illultiple-data 
(MIMD) loosely coupled architecture [Dasgupta89]. This fa mily of 
16- and 32-bit microprocessors can be used in single processor 
applications or linked together in a network to forro a 
lIlultiprocessor system. Each transputer contains a CPU. on-chip 
static RAM. timers, an external menory interface. and four high-
speed serial links which allow communications between processing 
nodes [CSA9Oc]. The THOO transputer series also has an on-chip 
floating point unit ( math coprocessor)[lm90).[HuI194]. 
It has been demonstrated that the trans puter is a powerful 
and flexible deuice for building large multiple-processor parallel 
systems [Leung90] t [Mohan90], [Pachowicz89]. [Kirland91,], 
[Gupta93], [Gray91], [PhillipS94]. [Stallard93]. [Lakshmi90] . 
[Stalker91]. [EET95]. The T800 transputer deliuers a 4 Mega 
Whetstone benchmark performance and a capability of 1.5 sustained 
MFlops/sec, and includes four 10/20 Mbits/sec IHMOS serial 
communication links. In addition toO their 4 Kbytes of fast on-chip 
RAM, the T400s and T800s can directly access a linear address space 
of up to 4 Cbytes of local memory and a data rate up to 50 
Mbytes/sec [CSA90c]. 
12 
There are a number of compilers and development sys te ms 
available for the use of most high level programming l anguages in 
transputer programming. Those languages include Ada. C. FORTRAN. 
MODULA-2, OCCAM ( the native language of transputers). PASCAL, 
CS_PROLOG. and T-CODE which is the transputer equivalent of 
assembly language [CSA90b].[CSA90a].[Kerridge94]. 
3.2 Generalization of the process-and-data-decompos i tion 
approach 
To generalize the process-and-data-decomposition approach 
in image rotation [Arabnia90] for applications to other image 
processing jobs. the author proposes a generalized topology. The 
organization of transputers in the generalized topology i s shown i n 
Fig. 2. The generalized topology is based on the archi tecture 
taxonomies and pipeline processing. Close observation reueals that 
the inner ring and outer ring topology of Arabnia is a special case 
of the generalized topology. This will be explained later after 
introducing the generalized topology. 
In the generalized topology. the transputers are connected 
in a mesh form. A row of transputers constitutes a level. In each 
13 
level. each transputer is connected to its left and right 
neighbors. And each transputer is connected to the corr espondi ng 
tt- ansp ut er in t he next level. The f i r s t transp uter i n the fir st 
le vel serues as a PC/Link connect i on to the host PC. The las t 
tr ansputer i n each level is connected to the fir s t transputer in 
the same leuel. The first transputer in the last leuel is connected 
to the first transputer in the first level. Double lines represent 
connections for both data communication and s9ste~ services (e.g. 
program loading). Single lines represent connections for data 
communication only. So, except in the case of the leftmost nodes. 
level links between nodes are used for data transfer only. 
In the generalized topology. image can be split into 
portions and fed to the first level of the pror.essors. All of them 
perform the same processing instructions. Then the processed pixels 
are channeled to the second level of processors all of which 
execute the second set of processing instructions. In general, pass 
the pixels processed at the itb level to the (i+1)th leuel of 
processors. All the levels together form a pipeline. Each level 
corresponds to a stage of the pipeline. Each level is a farming 
topology [Antola91]. 
This generalized topology is not only suitable for 
decomposition of a process into a number of subprocesses. It also 
is suitable for image processing jobs involving multiple processes, 
which corresponds to a MIMD structure [Dasgupta89]. That means each 
14 ' 
level of processors represents different processings necessary for 
the enh ancement of an image. 
~rabnia's topology in Fig.1 (a) is a special case of t he 
generalized topology with t wo levels an d each leuel has four nodes, 
the I/O function is performed by one of the nodes in the oute r 
ring. as shown in Fig. 3. 
Every node in the network will receive the whole image 
before computation starts. This is because of the computation 
problem that occurs at section boundaries, where a neighborhood is 
physically split across two transputers. And also because the 
processes at different levels could be in different area 
operations. 
This stUdy reported is based on this generalized topology. 
But this generalized topology have drawbacks in load balancing. An 
improved topology for process-and-data-decomposition using l inked-
trees is proposed later. 
In section 3.4 of this chapter we will describe the details 
of experiments to estimate the speed-up and time efficiency of 
proposed generalized transputer network used for low level image 
Block 4 
Bloc k 3 
Block 2 
Block 1 
(a) 
(b) 
Fig.l The transputer network for an image 
space divided into four blocks. (a) An image 
space divided into four blocks. (b) The network 
of four blocks [adopted from Arabni.a90]. 
15 
Fig . 2 
16 , 
Koot I ranspUler 
PC/Link 
Node 2 Node 3 Node 4 Node ~Ode 
Level 1 1------/ -----0.:::::::: 
~--~------~----~,----~ 
Level 2 
Level 3 
Level 4 -- ) -::: - -
~ / . 
'--__ -f:'--__ ---f '------- I , 
I I 
Level 5 I----\O~O- 0 :::::::::: =0 
'--_____________ _ _____ ...1 
Generalized .topology for process- and -9a ta d compo?ilion 
ppproach uS ing tronsputers for para llel Image rocesslng . 
~ double lin es are li nks for both communica tions and 
system se rvices , sinqle ~ines are for co mmunico lio s only .) 
17 
First level 
Second level 
PC host 
Fig.3 Arabnia's two rings network. This is a version 
of the genera~ized topo~ogy with two ~eve~s and 
four nodes in each level. Arabnia's image 
rotation algorithm can be implemented by the 
above topology. 
-18 
~rocessing jobs using process-and-data-decomposition approach. In 
section 3.6 of this chapter we describe experiment s wi t h a proposed 
new transputer topology, called linked-trees topology . We 
illustrate that it is a better load balanced topology for a 
transputer network used in image processing compared with the 
generalized process-and-data-decomposition topology descri bed 
earlier. It is a better topology than the generalized topology if 
load balancing in a transputer network is critical to its 
performance. But it also has the disadvantages that it is good only 
for pi~elining of images. not good for pipelining of segments. 
These two types of pipelining will be described later. Then the 
speed-up and time efficiency of the linked-trees topology will be 
investigated. 
3.3 Factors likely to affect speed-up and ti ~e 
efficiency 
In a parallel transputers network. the overall execution 
time of the network consists of the computation times and 
communication times among the transputers through their 
interconnections. In many applications such as robotic trajectory 
calculations [Mckeever92],[Zomaya92]. internode cOl1lmunications 
involve only a few bytes of data. But for image processing. 
communications of image pixels' ualue often are done data block by 
data block. thousands of bytes for each block at a time. A typical 
19 
block size is 100 x 500 pixels. Take a SI ND i mage process ing 
network as example. first, t he image needs to be split up and 
transmitted to each slave transputer of the SI MD or fa rmi ng ne t wol"k 
[Antola91]. After transformation, the transformed image of each 
slave transputer needs to be translqit t ed back to the master 
transputer for storage or output. So the co mmunication of da ta is 
uery intensiue in image processing job. 
The pur~ose of parallel network for image processing is to 
reduce the computation load of each node so as to reduce the 
overall execution time. But if the communication times among nodes 
is large. they will form a bottleneck for the overall execution 
time. That means if the number of nodes is increased in a 
transputer network. up to certain number of nodes. the overall 
execution time can no more be reduced because of the bottleneck 
mentioned above. Thus the speed-up and time efficiency of the 
transputer network will be affected. 
But if the computation is CPU intensive. the ef f ec t of 
communication time on the speed-up and time efficiency can be 
reduced. That is the reason why in our experiments. image 
operations of different computation intensiveness need to be chosen 
in order to bave a better understanding of the speed-up and time 
efficiency problem in a transputer network for image processing. 
The load balancing of the transputer network also affects 
the speed-up and time efficiency of the network. For example. in a 
20 
pipelining netwo~k, the wo~kloads in the nodes a~e not balanced, 
some nodes will have idle times wai ting fo~ the no des with slowe~ 
p~ocesses to finish. ~ t~anspute~ netwo~k with well balanced load 
will not have such a p~oblem . The expe~i lllent conducted in th i s 
stUdy fo~ pe~fo~mance analysis of the gene~alized to pology i s 
desc~ibed in the following section. The algo~ithms a~e implemented 
in G language. 
3.4 Expe~illlent 
Th~ee p~ocesses a~e executed independently of each othe~ 
using only one level of the gene~alized topology. Then these th~ee 
p~ocess a~e executed in th~ee levels of the gene~alized topology in 
a pipe lining mode. The speed-up and time efficiency a~e found fo~ 
both cases. 
3.4.1 A simple illlple.entation of farllling or SIMD topology 
The first pa~t of expe~iment one is based on the 
generalized p~ocess-and-data-decolllposition topology descri bed 
earlier. The experiment only involves one image process at a time, 
it is not a multiprocess ope~ation. So, only the first level of the 
generalized topology is used. The topology for the experiment is as 
21 
shown in Fig. 4, which is actually a far~ing topology . But the 
assumption and description of the generali zed topology stil l apply 
here. The network in Fig. 4 can have different nu ber of nodes. In 
our experiment, we execute three algorithms in a transput er chain 
network of one, two. and three transputers. The name s of the 
algorithms are image thresholding. image s moo thi ng. an d ilnage 
blurring. The.!,J are adopted from [Lindley91]. [Duff86]. 
Fig. 5 shows the data and operations flow in the three 
nodes network. 
3.4.2 Execution ti~e of the algorith~s 
The algorithlTls in these three nodes can be split into three 
parts. the receiving and translTlitting of the initial iP'lage. the 
computation. and the receiving and transmitting of the transformed 
image. The execution times for these three parts are represented by 
T1, T2 and Ta respectively. The file reading and writing times are 
ignored, because they vary according to the I/O hardware used. For 
exalTlple, different hard disks have different access t i mes which in 
turn affect the file reading and writing time. Therefore. we define 
total execution time as: 
Total execution time 
-22 
Only the master node algorith~ is timed. It is the master 
node which outputs the final transformed image to the user. So the 
overall job is not done, until the master has finished its job. 
The resulting times will be used to calculate the speed-up 
and time efficiency of the transputer network. 
3.4.3 Pipelined implementation of three processes 
In this part of the experiment, the three processes, namely 
image thresholding, image smoothing and image blurring [Lindley91] 
are executed consecutively in three levels of the generalized 
topology in the order above. Three nodes are used for each level. 
That means the topology used is a 3 x a mesh network which is shown 
in Fig. 6. The three levels of nodes form a pipelined network of 
depth three. So, data decomposition and process pipe-l i ning are 
included in the network. 
3.5 The types of pipelining 
There are two different types of pipelining possible for 
the 3 x 3 mesh generalized topology. First type is the pip elining 
of the whole image. This type of pipelining only gets paral lel 
23 
processing in pipe lining if more than one image is channeled into 
the network for processing. Pipelining of image is sui table in many 
applications where large numbers of image need to go through same 
transformation processes in the same order. An example is 
continuous quality inspection of manufactured goods by computer 
uision [Karkanis89]. Out in applications where only a single image 
needs processing. there will not be parallel processing. Fig. 7 
shows the data and operation flow of pipe lining of images in the 3 
x 3 mesh network. 
First. the first image is channeled to the first node in 
the first leuel of the network. The first node will communicate the 
image to each node in the first leuel. Then computations of one 
third of the image begin i.n all the three nodes. After that the 
transformed images are communicated among the three nodes in the 
first leuel. Then. the three nodes will communicate the whole image 
to the corresponding nodes in the second level. Hodes in the second 
level receiue image and begin computation on one third of the image 
each. Then communications of the transformed images among the three 
nodes in the second level start. After that, they trans mit the 
whole image to the corresponding nodes in the third level. Same 
operations as in the second level repeat here. The 
Fig. 4 Transputer chain network used for the 
speed-up and time efficiency experiments. 
Node 1 
J 
Read image 
from file Node 2 Node 3 
·L 
Transmit Receive Receive 
initial 
"-
image. image 
image / Transmit '1 
J image J, 
,, 1/ jeolXputationl conputation computation j 1 /' 
/Transmit Receive Receive 
transformed "- transformed "- transformed 
image / I image i image. 
J Transmit image. 1 
" 
1/ Receive Transmit Receive 
transformed / transformed " transformed f' / image image image 
J/ l ,JI 
Receive Receive Transmit 
transformed .r transformed IL transformed I' ..... image image. image 
J/ Transmit image. 
write image 
· to file 
-
Fig.S The data and operation flow of the three 
nodes chain network. 
24 
Root transputer PC 
link 
Fig. 6 A 3 x 3 mesh network using 
process-and-data-decomposition approach. 
2S . 
26 
Level I Level II Level III 
r;;;;;;;';;;~;;;;- -lo!''- -----l r«(>Ive IfOOqe 
ilonsmil image 
Fig. 7 Data and operations flow of pipe) ining of images. 
27 
final image will be channeled back to the root for 110. All 
communications and computation are done pixel by pixel. so image is 
not partitioned into blocks. 
The second type of pipelining is called segment pipelining. 
The image is diuided into segments. For example each segment 
contains 3 columns of pixels. In this case. once the nodes in the 
first leuel finished transformation of a segment of t he image. they 
will channel the transformed segments to the corresponding nodes in 
the second level for further processing. Then the first level 
nodes will continue transforming the second segment. The second 
level nodes begin processing almost as soon as the first level 
nodes begin processing the first segment. In the same way, the 
nodes in the second level will channel segments to the 
corresponding nodes in the third level as soon as they finish 
processing a segment. The computation wi ll stop a t each node when 
it is finished processing 1/3 of the whole image in s egments . Then 
communications among the nodes in each of the three levels occur. 
The nodes in the first level will send the other two thirds of the 
image to the corresponding nodes in the second level. Because they 
may need that to finish computations of their last segments. whi ch 
need pixel values in the border sections. Hodes in the second level 
will do the same to the third level. The final image i s channeled 
back to the root from the first node in the third level for I/O. In 
the case of pipelining of the three processes, the three proc esses 
wi ll be executed simul t aneous l y and t herefor e the three execution 
times will overlap with each other. Euen if only a single i mage 
28 . 
needs to be processed, parallel processing occurs and there is 
reduction in processing time by pipelining, unlike pipelining of 
images. In addition, if large nUlllber of different images need 
processing. they will also benefit from pipelining of segments 
because of parallel processing. 
Fig. 8 shows the data and operations flow of the pipelining 
of segments. 
Fig. 9 (a) shows the execution times of the pipelining of 
images. Four images are used as illustration. The total execution 
time for processing four images is equal to ( T, + T2 + 4*13 ) where 
T" T2, T3 are execution times for processes one, two and three 
respectively when they are executed individually on an image ( T, < 
T2 < T3 ). Fig.9(b) shows the execution times of the pipelining of 
segments. The total execution titTle fotA four images is equal to 4 * 
f3. The four different images are channeled to the 3 x 3 me s h 
network for processing. As shown. the pipelini ng of segments gives 
slightly shorter total execution time because there is overlapping 
of the execution times of the three processes. 
Here. we are interested in investigating the speed-up and 
time efficiency of the 3 x 3 mesh network using decomposition of 
data and pipelining of processes. The investigation includes both 
Level 
Node C ~.ode B Noae A 
(re« ~r,lQ'.JC {"linT ~J 
,k 
: l'Cf$" j! IMOCt 
'J 
~tOh(l!,\ 91 ~q,.,~nf 
~ J 
Lewl II 
\ ~1='7 Ncce E 
Level "I 
I\oce D Ncc= I Noae h Ncde G 
~;;J~,I~~t I '--!,'::?IVl" .m'lQe 
colTlli!OllOri -of s~ llt1f -------t~om2u'oII~tf'I J" - ,I, 
) I r~~J .. e St9mt!''lI' I lrc.o:sm'llr':"11;jlm~j l /' - . .,_1 "7 r!":~"tt s~mtnt ~ I (t,t-rW S~T!"l L l.......... I {rens"!'; : j·CflstO·_·" I .... ! ~ . , (/' I > " r----I . sej,....~ I'J· .~ 1 rf~~;;~~~ 6} :l~}l ,,-, ; pmOt;iOllOI'I 01 St~-ntrlt I )... ~OIT\Dul ): ull j Seqr"'~1 ' I ~ 3 ~OIl.n~ 01 t\lt~ IS) . I { ) ::II~r.,..s :l! rui? 5 ) 
'I--::::::"'i 'e-. ~ . ~ s~c-~,.. , ~ trOilSfl'I l aons·)rmt-:l 
, Yo 
I I 
. ,I, , ,,,, ,~~= ~ ·e'7~."!' 5~ ;- ~ · · ~ i':nSlT'lj t fa"l'i IO(r.'I~O ~I (, ' ~ :~' /~ se._.;"'1""!''' i p Ironsmll ltOIl$I OI"'~ 
t" , ,.~m,nl 1 1:, " L-,,;o-e" ~ ",-C""':- c' s..;~ ... ' I ~I ~ :Jrt,.:"'e' of "Tt<" I i 
IftC~~ ItrJCO: 
--L 
l .l'--.;()~,~;, .. " J' '''~ :'' 1~1 ... "",n! 
r 
-i'l 
T . 
:;: 
I!m~Ma,. =ri . . • ~ receIVe 'I""'q~ 
w S ~,e<, .. ~oe I K t ~ 11an,£.1 ",,~' "",'''", L-= i ,'. ! i ' I ".;""" ,; 
!,rru,.:f -. :;::::; ,rtn1Ce 1, " 
< ~ ~-:-
~ -r 
~ 
~ 
' :0-
,I : "' ___ 71 ... ) I, 1fT'69e' . 're~~ .~ .I"I": ;:? 
! ""'f "';q~ '~ ! I ""-;OTt "-----'. " I I " c, ,!,,~ ""Qq' r '\ ~ 
' y.:., .... ;; ...... jt ~ J. I 
--- " ' I " ,,,a,, ,""';, I .., \ lie" ;"' I 
"CR5tr l! H"' ~~ ;: : I ' 1 "R st ) ·_~= - .:'It' 
fT'Or',c;.~::", "no, :~~ ~ 
I M':~ i~ 'r ~ reot ~ re: ... · ,~ "'c .~ 
Fig,8 
I • '~I"" " ';) 
I ~'I·I~ .... : . 
Data and operations flow of pipelining of segments. 
,i 
t>l 
\Cl 
Image 
I1 
---T 
I2 
I3 
) 
Processes 
, 
1 2 
1 
') 
14 
Transformed image 
I1' 
3 :;;, 
I2' 
2 3 ) 
I3' 
1 2 3 ~ 
I4 ' 
') 1 2 3 ) 
Time 
Fig. 9a Image pipelining. Process execution times 
satisfy T3>T2>Tl, Tl becomes larger and larger 
because of idle ti.mes to wait for other processes 
to finish. 
Image Processes 1,2,3 Transformed image 
1[1 III 
I1' 
1,2,3 j 
I2 II II 
I2' ) 1,2,3 ~ 
13 13 ' 
) 1,2,3 
14 
, 1 ,2,3 I 
Time 
Fig. 9b Segment pipeline ti~ng. Segment 
pipelining. Process execution times satisfy 
T3>T2>Tl r but Tl, T2 are lengthened to T3, be cause 
of idle times to wait for process three to finish. 
And these execution times overlap with each other. 
~ 
) 
) 
I4' 
-t 
30 ' 
31 
pipelining of image and pipelining of segments. For illustration, 
four different images are channeled to the network for processing. 
3.6 linked-trees topology 
In the generalized topology, high speed-up is accomplished, 
but load balance is still a problem which causes low time 
efficiency of the generalized topology. A linked-trees topology is 
recommended here in order to balance the load of the nodes in the 
transputer network. Fig. 10 shows a typical linked-trees topology. 
It is a better topology for process-and-data-decompostion approach 
[Arabnia90] for image processing. All nodes in the same tree 
execute same set of image processing instructions. Each tree is in 
itself a farming topology [Rntola91]. 
3.6.1 Data partitioning in linked-trees topology 
To illustrate how image processing is done in a l i nked-
trees topology, consider a linked-tree with eight nodes. The root 
receiues image pixel ualues from the PC host. The whole image is 
diuided into eight blocks. The root channels the pixel ualues of 
the blocks to other nodes in the tree directly, or indirectly 
through some of the nodes. So each node receiues one of the eight 
blocks of image. Each node including the root will process one 
eigth of the image. After computation, the ualues of the processed 
32 , 
pixels are channeled back to the root of the tree. The root then 
sends the processed irnage to the root of the second tree for the 
second image operation. After processing by the second tree. root 
of the second tree then channels the processed pixels to the root 
of the third tree. and so on. The root of the last tree will 
channel the final processed image back to the root of the whole 
network which is the root of the first tree, for output/storage. 
Each tree executes one process, so the linked-trees topology is a 
multiprocess topology. 
There is no need to transmit the whole image to each node 
of a tree. Only the portion of the image need to be processed by 
that node is transm,itted to it, together with the border section of 
the adjacent portions of the image. This elirninates unnecessary 
communication times. 
Unlike the generalized topology of process-and-data-
decomposition approach. there is no link between indiuidual nodes 
of different leuels. So each tree can haue different number of 
nodes from the other trees. 
The rationale of such linked-trees topology is that each 
tree represents different image operations/processes. If one 
process is more time consuming than the other processes. there 
will be no waiting time for other trees or nodes to finish their 
jobs. The time consuming process will simply be giuen more nodes to 
Root transputer 
link 
Tree 
level 1 
Tree 
level 2 
Tree 
level 3 
:PC 
.-
--
-... 
---
--
.-
--
-
--
- --
-
-
---
--
--
--
---
-
Fig. 10 Linked-trees topology ( all lines 
represent links for both conmunications and 
system services. And each tree can have 
different number of nodes so as to balance 
load. ) 
33 
34 • 
finish its job faster. And in each topology. there is still 
decomposition of data and decomposition/pipelining of processes 
[Arabnia90]. So all portions of the deco~posed i~age and all 
decomposed/pipelined processes will be executed simultaneously. 
3.6.2 Illustration by exa.ples 
To illustrate the effectiveness of the linked-trees 
topology in load balancing. and its effects on the speed-up and 
time efficiency of a transputer network. three tasks of different 
computation intensiueness are used. The three tasks for~ a sequence 
of three processes and are executed on a linked-trees topology. The 
three tasks in sequence are: image smoothing. illlage thresholding, 
and blurring ( 10 x 10 pixels area) [lindley91].[Duff86]. These 
three levels of trees form a pipeline of processes. The number of 
nodes in each indiuidual tree is chosen such that each tree or 
process should have approximately same execution time, so one 
process in a pipeline does not have to wait for the other process es 
to finiSh. In this illustration. the execution time of each tree is 
timed and compared to see if loads in a linked-trees topology are 
well balanced. Also the speed-up and time efficiency of the linked-
trees topology are inuestigated. 
5 
3.1 Results 
This section will ~resent the results of the ex~eriments 
described in the previous section. The results are presented in 
three parts: results for implementing the farming or SIMD topology; 
res~lts for pipelining of images and pipelining of segments; 
results for implementing linked-trees topology. 
The results for experiments using network of four or more 
transputers are obtained b~ simulations. The simUlations use the 
execution times and communication times of three nodes chain or 
tree networks. 
3.1.1 A siflple i~plementation of far~ing or SIHD topolog~ 
The topology used in this experi~ent is as shown in Fig 4. 
Results are shown in Fig.11 (only the figures for image 
. thresholding are presented in table form) and Fig. 12. Figures in 
the table and all calculations are in 64 ~s per unit. Fig. 12(a) 
shows the percentages of total communication time to the total 
execution time against the number of transputers us ed in the 
network. Fig.12 (b) shows the speed-up ratios against the number of 
transputers used in the network. Fig.12 (c) shows the time 
efficiencies of the topology against the number of transputers used 
in the network. Eight different image proc~sses are experimented 
with the farming topology. namely, image thresholding, Laplace edge 
36 
enhancenent, image smoothing, blurring( S x S pixels area 
operation), blurring( 10 x 10 pixels area operation). blurring( 20 
x 20 pixels area operation). image nirroring. and image enlargenent 
[Lindley91].[Duff86]. Their co~putation intensiveness is linearly 
related to their size of operation. 
Fro~ the table in Fig. 11, image thresholding haue speed-up 
ratios < 1 for the far~ing topology using 2 or ~ore transputers ( 
speed-up ratios range from 0.7 to 0.9 for two to twenty transputers 
used in the topology). Using parallel co~putation for image 
thresholding costs more in total execution tine co~pared with 
serial conputation. I~age mirroring and inage enlarge~ent also show 
speed-up ratios of less than one. Therefore, this far~ing topology 
~ay not be appropriate for these three processes and other 
processes with sinilar conputation intensiueness in the transputer 
network. The ti~e efficiencies for these three processes are not 
plotted, because they have no meaning if speed-up i s less than 1. 
Fro~ the graphs in Fig.12 (b) and (c), it can be seen that the ~ore 
computation intensiue processes haue better speed-up ratios and 
tine efficiencies than the less computation intensiue processes. It 
can also be obserued that as the number of transputers increased, 
the total conmunication time became a dominant portion of the total 
execution tine, the speed-up ratio slows down and efficiency 
decreases. The conputation intensive processes are less affected, 
they haue milder slow down in speed-up and milder decrease in 
efficiency. Actually. the computation intensive processes also have 
their slow down as the number of transputers used i n the topology 
37 
increases beyond 20 transputers. So for all image processes, the 
speed-up curves consist of a linear portion and a slow down 
portion. The dotted lines in Fig.12(b) are the slow down portion 
for some of the computation intensiue processes. The slow down 
portions for processes D, C, and D are shown in the solid lines. 
The communication time became a dominant portion of the total 
execution time. This is not due to the total communication time 
increase. The computation time in each transputer reduces as more 
transputers are used in the network and therefore the communication 
time becomes a dominant factor. 
The fact that the total communication time becomes a 
dominant portion of the total execution time as number of 
transputers in the network increases is shown by the percentages of 
communication graph. Dut again, the total communication time is 
less dominant for the more computation intens ive processes compared 
with the less computation intensive processes. 
It is up to the users of the farming topology [Antola91] to 
decide whether an image process is suitable to be parallelized with 
cost effectiveness based on the graph in Fig.12. But point 
operations are certainly not suitable for parallel processing using 
transputers because their speed-up is less than 1. The 1- bit array-
processors may be the best alternative. 
--
Results of applying SIMI) topology in image thresholding 
Ito. of Total COIWIIIUcation C~tiOIl Sp""'CI.-~ n- Percelltave o( 
transputers execllUoJl tiDe part 1 ill t'-- part 2 i ll r a tio of eUiciellCY t.otal 
_d In 1.)00, t'-- tor tile the algor1.tlllll t l!.e al90r1.t lin tile o( tlloe ~catio .. 
SIN) t.opol.o<J}' snm ...,t.""' ..... Mt .. rl< t,-~t 
\..opol.ogr tJoe lot61. 
e,..,cv.tio .. 
U-
1 17100 
2 24000 1900 7600 <1 MIA 64 ~ 
3 23900 9000 9100 < 1 N/A 16 '" 
10 19900 9100 9000 < 1 MIA 9 1 \ 
20 19100 9200 9000 < 1 KIA 95 \' 
Fig. 11 Experiment results by executing various 
image tasks on a farming topology. 
38 
100 
90 
80 
70 
60 
50 
40 
30 
20 
10 
(a) 
20 
18 
16 
14 
12 
10 
8 
6 
4 
2 
(b) 
100 
90 
80 
70 
60 
50 
40 
30 
20 
10 
(c) 
" 01 tot.a..l cOlSaJLJ.cati.o . 
t.i.Jw t.o tot&.l eXl!c .... t ;i oa t.t..fo 
==~:====================== ~~ 
----======::::: ~ 
TillIE: eJuc:ie.-cy of t .. 
traaKpMLer .etwurk(', 
1. 
lD 
(A)Imaqe thresho1dinq 
(B)Imaqe smoothing 
J) 
Ita. or tr&lWlpwt.er. ;l.a t.lIIe 
aet. ....... 
..,..- - --
F 
.0. 0 t traa.pw:t'era ::1. t III.e 
aet...,E'Jr. 
Mo . o. ~ran.p.~er. ia l~ 
..,t'MIrk. 
(C)Lap1aoe edge enhanoement 
(D)B1urr~g (5 x 5 ,pixe1s area) 
(E)B1urring (10 x 10 pixe 1s area) 
(F)B1urring (20 x 20 pixe1s area) 
(G) Image mirroring 
(H) Image 'en1argem.ent 
Fig. 1.2 
results 
The graphs for experiment 
39 
Tot&! execwt.ion 'Total. execll't.ioJL Total. eJDeCw::t.ioa Speed- ,., Efflc1.e-=7 , $peed - .." II ffiaitmCJO·, 
l 1.-.e 'Lor t.1Je for t.~ if u... . pl,pel.l..u. p q,el:1.au l4J »~l.1u .. p4N!UU ... 
pipella:i..ag 01 " Pi»ella:l ... ot 4 ~. Are a'~ a' "- 01: .~.tA of ~.t.. ~. tllro.qh. ~_i"'IJ proce...e4 Illy 
t~ M:two'rk p1,pe1iu ... ot 
-.:1&1 o ...... t.e~ 
.~.t.a t.UO--vb 
t. ... ..,t.,X'k. 
T' T' , T'" 5' E' S · • B ' • 
T 1 .. T2 ... T 3"" .... T3 <t1+t.2-t 3 .... J"'/T' 5'" T"'n" S' ',. 
1241500 1162000 3405300 2.7 30 .. 3.0 33 ... 
(a) Resu~ts for pipe~ining 4 images through 
smoothing, Laplace edge enhancement and b~urring (10 
x 10 pixels area) using the genera.l..ized topology. 
, 
I 
, 
I 
Total. e.JOeC.-t1oa Yo-tal. e)lDflC'll'tio.a "fot..aI. eXBCWt.:1.G. SpeN-... 1:1 Uci.eM:1" ~ s.eed-... , .fUc:ie..cy, 
t.~ tOT t~ for t~ l' tlIe .. plpl.11.ai "IJ pipel~ »4N!1im..g p,...Ua1"IJ 
1'4'e1i-.i . .., or 4 p...,1iu.., at • :t..age. are af~. ot "- 01 
__ at-. 
., 
~ t. ...... ~ ~ ........ 1I~.ed by 
tJlle aetwork .,~1iai .. ' o~ .or:ioJ.. cmIIp.t.or 
__ at.. 
tk"_ 
t .... .,t-wock 
T' T' , T'" .. t S' E' 8' , 
Tl .. T2·T311-4 ."T3 (t:l+t 2..-t.3) 11-" T' , I/'T' S' ,'I TOO 'n" 
271200 184400 870800 3.21 34 .. 4.7 
(b) Results for pipelining 4 images through 
smoothing, smoothing(2nd round) and Lap~ace edge 
enhancement using the genera.l..ized topology. 
_at. 
E' • 
$ 1 ", 
52 .. 
Tot..&l !ClXDC".-t ioa Tot..al. OXDCwt.:Loa Tot..al. eleC1I't.1oa Speed-", zrU.c ieM:J'. Speed-", II I U c1.,acy , 
t.~ tor U_ for ~~ 11 t .... 4 pipll:l.a1.., ~l~ pl.pf!l:la1 ... p4N!l:f.ai"IJ 
.,ipel1d..- 01 4 p.t.peI1 Jd. .. oC 4 ~are ot' u..oe-
01 ___ 
or ..:.-:'.t_ or .e~at.. 
~ thro~ '- -:I. .. .,rocett.ed .lty 
t.t.e Jlet..work p '1pell Jd. ... Dr 8Crl al ~.t.er 
..."...,at:..a ~ko"" 
t. .... ~t.WDr. 
T' T' • T'" 5 ' E • 9 ' , 
Tl ... T2+T) ...... 4"'T3 (tt·t-2'·t, ·'H"". T'" IT' S' " T"'nll 
54"1900 461.000 21.7"1000 4 . 0 ~~ .. 4.7 
(c) Results for pipelining 10 images through 
smoothing, smoothing(2nd round) and LapLace edge 
enhancement using the generalized topology. 
J1 ' • 
S ' ." 
52 .. 
Fig. 13 Results an'd ca~culations for pipelining 
of images and pipelining of segments. T3>T2>T1, 
where T1, T2, T3 are the execution times for each 
level when they are processed individually. t1, 
t2, t3 are the execution times for each process 
using serial computer. 
40 
, 
t 
41 
3.7.2 Pipelining of i.ages and pipelining of seg~nts 
The schemes shown in Fig. 9 (a) and Fig. 9 (b) are used for 
calculation of the total execution times. Four i~ages are pipe lined 
through image s~oothing. blurring ( 10 x 10 pixels area ) and 
Laplace edge enhancement [Lindley91]. [Duff86] in order. The 
calculations and results are shown in Fig. 13 (a). 
In a load unbalanced parallel network, cORmunication times 
and idle times cannot be auoided. Both contribute to low efficiency 
of the parallel network. In an ideal net'lIIork. every second of the 
processors' time should be used on productiue work of computation, 
not on idling and communications. An ideal parallel transputer 
network should have 100% efficiency. 
If the three processes haue computation times close to each 
other. in other words, load in the network is well balanced, the 
idle time will be less. The efficiencies will be higher. because 
unproductiue p'rocessors' idling ti...e is less significant. Load is 
the execution time required for each node or leuel in the network. 
So here we explain a fact that load balancing of the generalized 
topology will affect the overall speed-up ratio and efficiency of 
the network. 
If for the sake of balancing. a part of the network has 
very low efficiency. the gain in efficiency due to load balance 
will be offset by this low efficiency. This is illustrated in the 
42 
linked-trees topology later. So it is not always trup. that a well 
balanced network will increase overall efficiency. 
In another experilllent. the three pipelining we used are 
sllloothing. slllOothing (2nd round) and Laplace edge enhancellllent 
[lindley91]. These three processes have close cOlllputation t i nes. 
The calculations and results are shown in Fig. 13 (b). 
The efficiency of this better load balanced network is 
better than the efficiency of the unbalanced network inplelllenting 
sllloothing. blurring and Laplace edge enhance"ent because of less 
idle tillle. 
We also conducted experiPients with 10 illlages going through 
the pipeline. The calculations and results are shown in Fig. 13 
(e). Fig.13 (b) and Fig. 13 (c) show that the nUlIlber of images 
pipelined will affect the overall efficiency for pipe lining of 
illlage. But the overall efficiency is still lilllited by the 
percentage of cOIllIlU..mication tillles and alllount of idle time . 
3.7.3 Linked-trees topology experi.ents 
Three processes. nalllely. blurring ( 10 x 10 pixels area), 
sllloothing and Laplace edge enhancelllent [lindley91] are executed in 
order on 4 images using the linked-trees topology. The topology 
43 
consists of three trees linked together for pipelining. The three 
trees are, tree of SO nodes for blurring. tree of 3 nodes for 
smoothing, tree of 3 nodes for Laplace edge enhanceillent. The 
calculations and results for indiuidual trees are shown in Fig.iS. 
The calculations and results for the whole linked-trees are shown 
in Fi].14 (a). The results show low efficiency. 
One ~ay ask why the efficiency for a balanced network is 
low. The efficiency of the ouerall network does not only depend on 
the amount of idle tillle ( which we haue successfully reduced by 
load balancing). but it also depends on the efficiency of 
indiuidual trees. [n order to haue e~ecution times of the three 
trees close to each other, the tree with SO nodes for blurring ( 10 
x 10 ) is forced to haue poor efficiency. So it affects the ouerall 
efficiency of the network. 
If we allow sOllie illlbalaoce in the linked-trees topology by 
using only.a 10 nodes tree for the blUrring job. using data for a 
10 nodes tree from Fig. is (d), the calculations and results are 
sho~n in Fig. 14 (b). 
Oy increasing the efficiency of the tree for blurring by 
using less nodes, the ouerall efficiency has increased. The speed~ 
up is half of the speed-up when SO nodes are used. This network 
will again haue better efficiency when compared to the generalized 
topology accomplishing speed-up of 7.6, which required a network of 
30 nodes instead of 16 nodes. 
44 
We provide summarized results of Fig. 14 (a), Fig. 14 (b) 
and other simulations into two graphs. in Fig. 16 (a) and Fig. 16 
(b). And a cOlQparison of network efficiencies and nUlQber of 
transputers used for the generalized topology and the linked-trees 
topolo99 accomplishing the same speed-up ratio is done in Fig. 17. 
One can observe that the linked-trees topology is superior in 
process-and-data-decomposition approach by using less nodes and 
having higher network efficiencies in accomplishing the sa...e speed-
up when cOlQpared to the generalized topolog9. The disaduantage of 
the linked-trees topology is: it is not capable of pipelining of 
segment. 
One can observe that by increasing the number of nodes in 
the tree for blurring by 40, the speed-up is only doubled. It ma9 
not be cost effectiue. So it depends on the user of the topology. 
who needs to decide whether the speed-up Dr the efficiency is more 
important. Then he/she can choose between a perfectly balanced 
linked-tree and a slightl9 unbalanced one. 
Among the linked-trees topology. the generalized topology 
and the farming chain topolo99. linked-trees topology giues better 
pfficienc9 than the generalized topol099 while accolQplishing the 
same speed-up ratio. That means users can use less nUlQber of nodes 
for linked-trees topology while accomplishing th~ same speed-up 
... 
Total e~tloa time Total e~oa t~ SI>eed-.... , EtttC1el1CY, 
tor plpel:U.i1lq of 
" 
if Ue" ~ are Pi»e.liaUq of p1pe11w1 q ot Ui!IgN 
u.age. tno. t~ procea_ b:r ... rial ~ 
aetwork C<JIIIOllter 
T' Till S ' E' 
11 .. 12 .. 73"'. (tl'U .. t3)"'. TI'IIT' S'/5' 
216580 34053BO 15.7 2B" 
(a) Results for pipelining 4 images through 
three linked-trees (3 nodes tree for smooth~g, 
3 nodes tree for Laplace edge enhanoement and 50 
nodes tree for blurring (10 x 10 pixels area». 
Tatal ..-cnioa l~ 1'atal e,."c.Uoa t.~ Speed-~, E f ticiellC]' • 
for pflpeHJIi..., of 
" 
it tile" 1.aoIIgea are pipeu...:1aq 01 p:!ilel1l1i .. af ~ 
DIaogoI. tku lIgIo tJoe proc_ by aerial iII!Ioges 
_ ...... rk .,...,.ter 
T' TI II S' II' 
Tl+T~+T3*4 (tht2+t3' *4 T"'IT t S'/U 
447639 3405380 1.6 4B% 
(b) Results for pipelining 4 images through 
three linked-trees (3 nodes tree for smoothing, 
3 nodes tree for Laplace edge enhancement and 
10 nodes tree for blurring (10 x 10 pixels 
area» . 
Fig. 14 Results and calculations for 
pipelining through the linked-trees topology. 
T3>T2>T1, where T1, T2, T3are the execution 
t~es for each level when they are processed 
individually. t1, t2, t3 are the exeoution 
times for each process using serial oomputer. 
45 
Tot aJ. ._c .tt..oa c.-...i..o.t'i.<Mt. c.o-ud.c.~i.. •• Spe.d-.., E'a\.t..o ...... "'.rc._ t.-ap of 
u _ t~p ... ~ 1 t.:t..- ..... t. :t .... of U_ .t: a cd.....cu" 1.0""" 
:t.. t.~ ~ ... lIll.oerit. .... _ t.-.ork . (~ ... of ~ .. ~1!-.. t.,~D. 
e.l.QIOrit.l'lnl t._ e.-eo. t t.o . .... ~ ........ u _ .aou .. t. 
~1.-
..... -
. ... "0"-1. 
. erl. .... .~.t.i.o .. t..s.-
c a.Ip'lil.t. .. r) 
l' Te l Te2 "/1' LITIIIO. o f (Tc1.l'c 2) IT 
--.. 
3559~ 5500 4900 1.9 63 .. 29 .. 
(a) Results for 3 nodes tree for smooth~g when 
processed ~ndividually. 
Tot....J. -=-c-ct.1o.oa C~oat.:t.OA c-ai.c .. "L ..... --...-~ :ra t.:l.o '"'- "rcM~'" oil 
t ... 
,"--- .. t..:t- p ..... t. • ... .. ~ ... 
.'Itct. • ..".. .... ..... 
fa. t.~ t_ a.1.go.:I;'l,t ... _t..."r •. (0. i.e .r "'" ~oa.L1,o. 
.... ~ .... ~ e...a-. 't..j,OA _i:._~k U _ .. d._t. 
t.u-
-1"<1 
t:._ to4:. &1 
_rial. ~.L.:6. ... t"-
~..t:. .... ) 
r Tel Te! UT t./T/aD. at ~Tc1·Tc,) rT 
-
37700 5200 4700 2 . ~ 72 .. .26" 
(b) Results for 3 nodes tree for Laplaoe edge 
enhancement when prooessed individually. 
1'0'""'- eJlGre .. t.t..,. C~oat.i,CI. ~O'.a,t..ioa. --.., ralt.:l.a "u. hro..t.-v- 01 ts.- t:. -a.- "art. 1 t.u.- II...-t z i. of to ... .t:'11..c:L.-=7 t.ot.al. 
.... " ... U .. aJ.goa:~t.'- _t-r"_ (tc ... of ..... ~c-a .. io • 
a14Ior1t:-:.., ,- .~t.1-o. --~II:'j( " .... 
-" "0.- -'--9 'o. . ..., .... .. r:I.~ .~t. i.D" t.:J.-
~.t-... , 
T Tcl Tc'2 lfr "/1'/_. or (TeL-TcZ ) IT 
--
30200 0000 8000 2.3 46 .. 5 3 .. 
(c) Results 
x 10 pixels 
for 50 Dodes tree for blurring (10 
area) when prooessed individua l l y . 
Tot..II.l. _..c at t.o .. C.-a:10At.:J..,. 
..... 
-t.:I..- p-.r't. 1 
i... t.~ 
&l.golt .. t: .... 
T Tc l 
93578 8000 
(d) 
10 
Results for 
pixe1s area) 
c~c .. ' :l_o. 
--.... 
rAt. 1 0 .-- .. reI ..... t. ...... ... 
t.J..- part % ~ .. of~t .. _ l t i o 1.-.o,r 0.0 ..... 
U ... ..:1gor:t.u... _ t....v.-" . CO ... o • ' w ~c.\..t.o. 
.. ..., ~_t-1o .. ... t._rk u... .. ~_L 
...... -~-•• ri-a1 
0 "..._1: .... ' 
TcZ t./T 
8000 7 . 4 
10 nodes tree 
when prooessed 
t..... Lot....:L 
--.::I.t. s.oa ~:&.. 
tJT/ao . u t CTc1.Tc ', IT 
--
75 .. 17 .. 
for blurring (10 
individually. 
Fig. 15 The resu~ts of the three l~nked­
trees when they are prooessed indiv~dually. 
46 
60 
40 
20 
, E f f i c i e n c y of t he wh o1e 
1.i.nked - t r.... ( % ) 
/ 
5 10 20 
NU~ of n ode . i n the 
t h r ee J. J.n k e d - tr .... s f or 
b1urring 
50 
Fig. 16(a) The effeot of number of 
the linked-tree for blurring on the 
of the whole l~ed-trees topology. 
nodes in 
effioienoy 
Speed- up o f t h e whoJ.e 
15 J.inked-tr .... ( %) 
10 
5 
5 10 20 
Number o f node s in the 
thr .... 1inked - tree.. .f or 
b 1ur rinq 
50 
Fig. 16(b) Effeot o f the numbe r o f n odes in 
the lin ked-tree for blurring o n the spee d -up 
ratio of t h e whole ~inked-trees topol ogy. 
Tata.1 R_r a ( To'l..a.l. 
__ r 
o. Speed- .., _ 11:1.11; 1._ -=7 U .... i • . ftC)" 
--
i~ 
.. -
-.oeM,e i._ U .. l..:1aJi1ad - .... 1. to ( o r ( o r ~ ... <or .... 
Qe"Moe r al:l £oed t...apology ~ .. -- t..apo1.ogy bot.h ... t:.orll ...... al. :I. . .. l i.ftJoI.ecI -t. __ " 
aooanu~1.IIII Jt.i ftQ e. .. t..ovo1otr.1" t.qpoo1 ,otD' 
--
--
.... 'l. i.o 
150 S 6 15 . 7 12'< 20 .. 
60 26 1 . .1. 6 19 .. 4 .. .. 
3 0 16 7.6 2 5 .. tEl .. 
15 11 4.5 2 0 .. 4 1 .. 
Fig. 17 Comparison of network efficiencies and 
number of transputers used for general.ized 
topology and l.inked- trees topology 
accomplishing the same speed-up ratio. 
.. 
47 
48 
~atio as the gene~alized topology. COfflfflunication tiffles afflong nodes 
in the linked-trees topology is less than those of the generalized 
topology. because only the pixel ualues that need to be processed 
by each indiuidual node will be transmitted. The farffling chain 
topology is a special case of the generalized topology. it is 
sifflilar in speed-up and efficiency to a linked-trees topology which 
consists of one tree. 
49 
CHAPTER IU 
CONCLUSION 
In this thesis, a generalized to~ology for image processing 
is pro~osed and its effectiveness illustrated. The effect of 
process computation intensiveness, percentage of communication 
time, load balance of the network, and the number of nodes used in 
the network on the efficiency and s~eed-u~ is illustrated. In order 
to achieue better resource utilization, a linked-trees topology is 
~roposed. The effectiveness of the linked-trees topology in load 
balancing is illustrated. Load balancing is intended to improue 
efficiency of the network. The fact that perfectly load balanced 
linked-trees network may in some cases haue low efficiency is 
illustrated. This ha~pens due to other factors such as individual 
efficiencies of trees that affect the overall efficiency. 
The nature of point operation and small area operation 
makes them very suitable for 1-bit ULSI processors array. 
Transputer network for image processing despite its great 
~rocessing power, is not competitive and suitable for these jobs. 
But for high level image processing jobs and large area operations, 
where the 1-bit processors array is not capable or not efficient, 
transputer network have great potential for development. More image 
~rocessing jobs with more complex algorithms need to be tried out 
to further illustrate the usefulness of the generalized topology 
and the linked-trees topology_ 
.. 
50 
There is no significant difference in processing time 
between pipelining of images and pipelining of segments if the 
number of images pipelined is large. The linked-trees topology is 
good only for pipelining of images. But if an algorithm requires 
pipelining of segments and at the same time. load balance is 
desired to increase network efficiency. linked-trees topology is 
not a good choice. Exploration of topologies suitable for the aboue 
mentioned task is suggested as future work. 
q 
5] 
Bibliography, 
[Arabnia90] 
Arabnia. H.R., ., A parallel algorithm for the arbitrary rotation 
of digitized images using process-and-data-decomposition 
approach", Journal of parallel and distributed computing, No. 10, 
pp. 188-192, 1990. 
[Antola91] 
Antola, A., Tellarini, M., ,. Definition and evaluation of a 
transputer-based architecture for image compression and 
reconstruction", Microprocessing and Microprogramming, Uol.31, pp. 
127-132, 1991. 
[Ashford92] 
Ashford, R.W., Connard, P. , Daniel, R., " Experiments in solving 
mixed integer programming problems on a small array of 
transputers", Journal of operation Research Society, Uol. 43, Ho. 
5, pp. 519-531. 1992. 
[Crookes89] 
Crookes. D., MorrolJJ, P. J., Sharif, B., McClatchey, [.. " An 
environment for deueloping concurrent software for transputer-based 
image processing", Microprocessing and Microprogramming. Uol. 27, 
pp. 417-422, 1989. 
[CSA9 0a] 
Logical Systems C for the transputer: Uersion 89.1 User Manual. 
Computer System Architects press, 1990. 
[CSA9 0b] 
Transputer'Education Kits User Guild. Computer Sys tem Architects 
Press, 1990. 
[CSA90c] 
Transputer Architecture and Overview, Computer System Architects 
Press, 1990. 
[Dasgupta89] 
Dasgupta, S., Computer Architecture, a nodern synthesis Uol. 2, 
John Wiley & Sons. 1989. 
[Duff86] 
Duff, M.J.B., Intermediate-Level Image Processing, Academic Press, 
1986. 
[EET95 ] 
"Transputer is the core of control", Electronic Engineering Times, 
t1at'ch 20, 1995. 
[Fairhurst88] 
Fairhurst. M.C .• Computer Uision for Robotic Systems, Prentice 
Hall. 1988. 
[Gupta93] 
52 
Gupta, A .• Kumar, U., " Performance properties of large scale 
parallel systems", Journal of parallel and distributed computing. 
19(3), Nouember, 1993. 
[Gray91] 
Gray, J.P., Pocle, F., "Object-oriented approach for transputer-
based database system", Information and Software Technology, 
Uol.33, No.1, February 1991. 
[Hersch93] 
Hersch, Roger D., " Distributing Pixel Images Among Parallel Disk 
Arrays", Microprocessing & Microprogramming, pp. 33-36, Jan 1993. 
[Hu1194] 
Hull, M.E.C., Crookes, D., Sweeney, P.J .• Parallel Processing, The 
Transputer and its applications, Addison-Wesley, 199.IJ. 
[ Im90] 
"Parallel world of a nelll superpower". International Management, 
pp. 58-60, November 1990. 
[l<arkanis89] 
Karkanis, S •• Metaxaki -Kossionides, C.. Oimi triad is , B., U A 
machine-vision quality inspection system for texture industries 
supported by parallel multitransputer architecture", 
Microprocessing and Microprogramming, Uol. 28, pp. 2.IJ7 - 252, 1989. 
[l<erridge9lJ.] 
Kerridge. J .• "Dynamic allocation of processes and channels in 
T9000/C16.IJ.networks using OCCAM 3", Progress in Transputer and 
OCCAM Research, lOS Press, pp. 1-17, April 199.IJ. 
[Kirland91] 
Kirland, C. Y "Transputers: controllers for the 1990's", Plas tics 
World, pp. 62-6.IJ. Nouember. 1991. 
[Lakshmi 90] 
Lakshmiuarahan. S .• Dhall. S.I< .• Analysis and Design of Parallel 
Algorithms: Arithmetic and Matrix problems, McGraw-Hill. 1990. 
[Leung90) 
Leung, C.H.C., Ghogorou, H.T .• Mannock, K.L., 
Relational Database Systems on Transputers", 
Transputers, uol. 1, pp . .lJ30-436, 1990. 
[Lindley91 ] 
"High Performance 
Application of 
Lindley. C.A .• Practical Image Processing in C, John Wiley ~ Sons. 
1991. 
S3 
[Morrow91] 
Morrow. P.J., Crookes. D., "Parallelising an image segmentation 
and analysis system for infra-red images", Applications of 
Transputers 3, vol 2, pp. 327-332, 1991. 
[Mckeever92] 
Kckeeuer. J.D.M., Holton, D.R.W., Mckeag. R.M., "Using transputers 
in a robotic programming and control system", Microprocessing and 
Microprogramming, Uol. 34, pp. 117-120, 1992. 
[Mohan90] 
Mohan Kumar, J., Patnaik, L.M., Prasad, O.K .••• A transputer-based 
extended h9percube", Microprocessing and Microprogramming, Uol. 
29, pp. 225-236, 1990. 
[Pachowicz89] 
Pachowicz, P.M., •• Image processing by software parallel 
computation", Image Vision Computing, Uol. 7, Ho. 2, 1989. 
[PhilipSD94] 
Philips, D., Image Processing in C, R&D Publications Inc., 1994 
[Phillips94] 
Phillips. I., Parish, D., "On the use of transputers in Multimedia 
teleconferencing system", Progress in Transputer and OCCAM 
Research, lOS Press, pp. 148-154, Rpril 1994. 
[Russ92] 
Russ, J., The Image Processing Handbook, CRC Press. 1992 
[Schomberg89] 
Schomberg, H., "Image Processing on a Transputer-based Perfect 
Shuffle Machine", Microprocessing & Microprogramming, pp. 277 - 280, 
Jan 1989. 
[Stalling90] 
Stalling, W., Computer Organization and Architecture, MacMillan, 
1990. 
[Stalker91] 
Stalker, M.D., " Simplifying parallel programming on the 
transputer network", Computer Technolog9 Review, pp. 15-22, summer 
1991. 
[Stallard93] 
Stallard, P.W.A., Duno, R.W., Daniels, A.R., "Dynamic real - time 
scheduling for a parallel production system on an enhanced 
transputer arra9", Transputer and OCCAM Research: Hew Direction, 
lOS Press, pp. 218-231, March 1993. 
54 
[Tomohisu93] 
Tomohisu, k., Motok, 0., Yoshio, S., Atuo, ~., "Pa~allel ima ge 
p~ocessing fo~ defect inspection b9 lage~ed st~ucture of 
t~ansputers··. Transputer/OCCAM Japan 5, pp. 161-170, June 1993. 
[Uhr87] 
Uhr. L., Parallel Computer Uision, Academic Press, 1987. 
[Zoma9a92] 
Zoma9a, A.Y., •• On the fast simulation of di~ect and inverse 
Jacobians for robotic manipulators", Robotics and Autonomous 
S9stems. Uol.10, pp. 43-61, 1992. 
Thesis: 
Maj or Field: 
Biographical: 
UITA 
Kit Sai Wong 
Candidate for the Degree of 
Master of Science 
PROCESSOR TOPOLOGIES FOR IMAGE PROCESSING 
APPLICATIONS 
computer Science 
Personal Data: Born in Hong Kong, on June 24, 1959, the son of 
Wong Man Sun and Pong Yip Moi. 
Education: Graduated from Pentecostal School, Hong Kong in 
1977; received the Bachelor of Engineering degree in 
Mechanical Engineering from Hational Univer s ity of 
Singapore, Singapore in 1985; completed the requirements 
for Master of Science degree at Oklahoma State 
University in May, 1997. 
Professional Experience: Engineer, Seagate Technology Ltd .• 
Singapore, March 1986. to March 1987; Engineer. 
Micropolis Ltd., Singapore, March 1987. to October 1989; 
Engineer, Hong Kong Polytechnic, Hong Kong, October 
1989, to July 1991. 
