Performance Evaluation of Linearly Extensible Multiprocessor Architectures for Networking by Samad, Abdus
PERFORMANCE EVALUATION OF LINEARLY 
EXTENSIBLE MULTIPROCESSOR 
ARCHITECTURES FOR NETWORKING 
ABSTRACT 
OF THE 
THESIS 
SUBMITTED FOR THE AWARD OF THE DEGREE OP 
doctor of pt)tlo£(o|3tip 
IN 
COMPUTER ENGINEERING 
BY 
ABDUS SAMAD 
Und«r the Supervision of L ^ > ^ 3 
PROF. M. QASIM RAFIQ DR. OMAR FAROOQ 
Supervisor Co-supervisor 
DEPARTMENT OF COMPUTER ENGINEERING 
FCAULTY OF ENGINEERING & TECHNOLOGY 
AUGARH MUSLIM UNIVERSITY 
ALIGARH (INDIA) 
2009 
Abstract 
An explosive growth of the Internet has lead to a significant increase in 
demand for fast and effective retrieval of information. Web services take 
different approaches towards enhancing their performance. For distributed file 
transfers, parallel downloading is one of the effective schemes to reduce the 
access time/down load time. In parallel downloading, single file is transferred 
by using multiple connections to different file servers. However, the 
effectiveness of parallel downloading is also dependent on the physical 
network topology, i.e., if some connections share the same bottleneck link, the 
performance is not increased and the additional connections simply waste the 
server resources. To reduce these bottlenecks, enhancing the server 
performance has become a critical issue to cope up with the increasing use of 
Internet based services. The approaches to improve the server performance are 
either software based or hardware based. In software approach, the 
performance of the server can be improved by having effective algorithms to 
improve the server's performance, and hence minimizing the access time of 
user's requests. In hardware approach, additional computing power can be 
achieved by applying various techniques together and also by adding more 
processors into the single system. 
The continuous technological developments and use of nanotechnology 
have made the computers as high performance complex systems. However, the 
computing power of a single node server is limited and still can not satisfy the 
requirements of many Web applications. On the other hand, with the 
advancement of VLSI the cost of installing and maintaining a k-processors 
system is significantly less than having k- separate single processor systems. 
This gives more scope to design high performance server by exploiting 
parallelism among k-processors machine in a more efficient manner. In terms 
of hardware, this typically means providing multiple simultaneously active 
processors (nodes). In terms of software, it means structuring a program as a 
set of largely independent subtasks. 
Research is active in the direction of developing new multiprocessor 
architectures and scheduling the partitioned program onto it in order to achieve 
higher computing power and scalable parallelism. One of the important issues 
in the design of massively parallel systems is the choice of interconnection 
topology. For this reason, a plethora of intercormection networks have appeared 
in the literature. The most popular of these include the hypercube, de Bruijn, 
mesh and tree networks. The widely commercially accepted network topology 
is the binary n-cube also known as the hypercube network. The hypercube 
topology has been used in numerous parallel systems such as the Cosic Cube, 
Ametek S/14, iPSC, the Ncube etc. The attaractiveness of the hypercube 
topology is its small diameter and fault tolerance. The diameter of a network is 
the largest distance a message has to travel to reach its final destination 
between two nodes. Low diameter is better, because the diameter indicates the 
maximum number of distinct hops between source and destination nodes and is 
an important parameter as communication cost. The hypercube has a 
logarithmic diameter. The major drawback of the hypercube network is that it 
is not scalable due to its exponential expansion and difficuh when considering 
its VLSI layout. The de Bruijn is other network which has a constant node 
degree, however, like hypercube it also suffers with exponential expansion and 
hence more complex. A new topoloy named as Linearly Extensible Tree (LET) 
has been reported which has lesser number of nodes and smaller diameter. The 
complexity of extension in LET increases linearly and each nth extension 
requires adding a single layer of (n+1) nodes. 
To achieve a high performance and effective sharing of Computing 
resources, it is important to distribute the load (tasks) evenly among different 
nodes. Therefore, efficient scheduling schemes are required to map the load 
onto the set of nodes. The main problem faced in the design of scheduling 
algorithm is the lack of information about the network load distribution and 
hence the task execution time estimation. In such situations a dynamic nature 
of the algorithm is required, however, at the cost of extra overhead to 
determine the state of the task and the network itself. The dynamic scheduling 
works on the fly, that is, all information are obtained as scheduling is in 
progress. Several researchers, working in the area of parallel processing, have 
reported different multiprocessor architectures to implement various scheduling 
schemes for the better performance on different architectures. The research in 
this direction indicates that the number of processors in a multiprocessor 
architecture are being reduced thereby reducing the cost and complexity of the 
architecture without losing the performance of the architecture. 
The present work, reported in this thesis, is concerned with the design 
and development of a new multiprocessor architecture, called Linearly 
Extensible Cube (LEC) network, which is proposed to work as a server. A new 
dynamic scheduling scheme named as Two Round Scheduling (TRS) scheme is 
devised to utilize all the nodes of the proposed server effectively. The 
performance of the proposed server is evaluated for balancing different types of 
load among various nodes of the server using TRS scheme. In addition to this, 
a second algorithm is proposed and implemented on LEC server that manages a 
number of retrieval queries. 
The server model proposed is a Linearly Extensible Cube (LEC) 
multiprocessor network, which exhibits the desirable properties of similar types 
of multiprocessor networks. The LEC network facilitates linear extensibility 
with only two nodes per extension. It can maintain a constant node degree 
regardless of the increase in network size (e.g. number of nodes in a system). 
The network has lower diameter, hence reduces the average path-length 
traveled by all the messages, lesser complexity, and a high bisection width. A 
comparative analysis is carried out which shows the superiority of the proposed 
server in terms of topological properties. 
Ill 
The dynamic scheduling scheme TRS has been proposed which 
schedules different types of load on the proposed LEC network. The TRS 
scheme uses the adjacency matrix to check the level of connectivity. The 
scheme uses three steps. In the first step the load is mapped on the processors 
followed by the identification of donors and accepters by calculating the value 
of Ideal Load (IL). The IL is calculated by summing the load available on each 
node in the network at a particular load stage divided by the total number of 
nodes in the network. When the load on any node is greater than this IL, it is 
donor node and when the load on a node is less than the value of IL it is an 
acceptor node. Tasks are diffused from donor to accepter considering the first 
level of connectivity of the network with the help of adjacency matrix in the 
second step. Finally in the third step, those processors are considered for load 
migration, which are not directly connected. The performance is evaluated in 
terms of Load Imbalance Factor (LIF), which indicates the load imbalance after 
a balancing action at various stages of the load. 
The proposed scheduling scheme is also implemented on other similar 
multiprocessor networks namely hypercube, de Bruijn and Linearly Extensible 
Tree (LET) networks. The results obtained from the simulation studies show 
that LEC is performing better with the proposed TRS scheme. To check the 
effectiveness of the proposed scheduling schemes, other standard scheduling 
schemes namely Minimum Distance Scheduling (MDS), Hierarchical 
Balancing Method (HBM) and Gradient Model (GM) schemes are also 
implemented on the LEC network. The comparative results obtained show that 
the proposed TRS scheme quickly balances the load on the proposed LEC 
network. Therefore, the proposed model (LEC network with TRS scheme) is 
considered as a better organizational model. 
To check the effectiveness of the LEC network when used as a server, it 
is tested for both; handling the unpredictable communication traffic and 
servicing a number of queries (requests). The proposed TRS scheme is used to 
utilize all the nodes of the LEC network for un-even communication traffic by 
IV 
migrating the load in various block sizes to check and evaluate the performance 
of the server in terms of load balancing time at various load imbalances. 
Similarly, a second algorithm is proposed and implemented on LEC server for 
information retrieval. This scheme organizes the information available on the 
server (database) in such a way that a fast retrieval of information could be 
achieved. A table consisting of packet ID's and their addresses is maintained 
which is accessible by all the nodes of the server. Through simulation, a 
number of search queries are processed from the proposed LEC server and the 
service/search time is evaluated. The results obtained indicate that as the 
number of queries increases the search time remains approximately constant. 
The same scheme is also applied on single node server and a comparative study 
is made with the proposed LEC server. The results show that, the given 
multiprocessor server reduces the information retrieval time by a factor of the 
number of processors available in the network. 
In the present work it is found that at a lower cost, proposed LEC server 
reveals the desired characteristics of a multiprocessor system such as small 
diameter, constant node degree, linear extensibility and a high bisection width. 
It is performing better in terms of network utilization when compared to other 
similar networks particularly for uneven traffic. The proposed network when 
used as server can effectively performs retrieval of information efficiently and 
may be used in the Web services. 
List of papers from PhD Thesis 
Internationat Conference Papers 
[1]. [Samad et al., 2010] Samad, A., Rafiq, M. Q., and Farooq, O. (2010). 
LEC: An efficient scalable parallel interconnection network. Accepted 
for presentation in the International Conference on Emerging Trends in 
Computer Science, Communication and Information Technology 
(CSCIT'2010), to be held atNanded, India, from 09-11 Jan., 2010. 
[2]. [Samad et al., 2009] Samad, A., Rafiq, M. Q., and Farooq, O. (2009). 
Effective Information Balancing on a Multiprocessor Server. In 
proceedings of IEEE International Advanced Computing Conference 
(IACC'09), Patiala, India, pages 1215-1219. 
[3]. [Samad et al., 2008] Samad, A., Rafiq, M. Q., and Farooq, O. (2008). A 
Novel Algorithm for Fast Retrieval of Information from a 
Multiprocessor Server. In Proceedings of ?"' WSEAS International 
Conference on Software Engineering, Parallel and Distributed Systems 
(SEPADS '08), University of Cambridge, UK, pages 68-73. 
[4]. [Samad and Rafiq, 2005] Samad, A. and Rafiq, M. Q. (2005). A Novel 
Server Architecture for networking. In proceedings of International 
Conference on Robotics, Vision, Information and Signal processing, 
(ROVISP2005), University Sains, Malaysia, pages 1029-1032. 
VI 
PERFORMANCE EVALUATION OF LINEARLY 
EXTENSIBLE MULTIPROCESSOR 
ARCHITECTURES FOR NETWORKING 
THESIS " ^ ^ 
SUBMITTED FOR THE AWARD OF THE DEGREE OP 
Boctor of ^I)tlQ£(ophp \ 
COMPUTER ENGINEERING 
* V 
| l ///:$mSf| > 
BY < 5 ' / , 
ABDUS SAMAD 
^ ' \ 
Under th« Supervision of 
PROF. M. QASIM RAFIQ DR. OMAR FAROOQ 
Supervisor Co-supervisor 
OEPARTMEIMT OF COMPUTER ENGINEERING 
FCAULTY OF ENGINEERING & TECHNOLOGY 
AUGARH MUSLIM UNIVERSITY 
ALIGARH (INDIA) 
2009 
i CoismBtef 
"^ J^Wuslim UB!i e < ' ; 
^ 
Q2 m 2015 
CERTIFICATE 
This is to certify that the thesis entitled ''Performance 
Evaluation of Linearly Extensible Multiprocessor 
Architectures for Networking", which is being submitted by 
Mr. Abdus Samad for the award of the degree of Doctor of 
Philosophy in Computer Engineering from the Faculty of 
Engineering and Technology, Aligarh Muslim University, 
Aligarh, India, is entirely based on the work carried out by him 
under our supervision and guidance. The work reported, 
embodies the original work of the candidate and has not been 
submitted to any other University or Institution for the award of 
any degree or diploma, according to the best of our knowledge. 
(Dr. M. Qasim Rafif^ ) 
Professor, 
Dept. of Computei/Engg. 
Supervisor 
(Dr. Omar Farooq) 
Reader, 
Dept. of Electronics Engg. 
Co-supervisor 
Aligarh Muslim University, 
Aligarh, India 
December, 2009 
Acknowledgements 
I am very much indebted and wish to express my sincere thanks to 
my supervisors Prof. M. Qasim Rafiq, Professor and Chairman of 
Department of Computer Engineering and Dr. Omar Farooq, Reader, 
Department of Electronics Engineering, A.M.U., Aligarh. No amount of 
words would be sufficient to communicate my deep sense of gratitude to 
my guide Prof. M. Qasim Rafiq, who motivated me to work in the area of 
Parallel Processing right from the beginning of my post graduate studies. 
He has been a continuous source of encouragement and has given his 
erudite guidance and advice with constant keen interest at every stage of 
this work. I am grateful to him for sparing his valuable time in spite of his 
busy schedule in the Department and for all the kindness he has shown 
towards me throughout the period of my research work. 
I extend my gratitude to Dr. Omar Farooq, my co-supervisor with 
all my hearts for his invaluable guidance, constructive suggestions, 
constant help and inspiration during the course of this work. He has all 
along been easily and friendly approachable to me in the hour of need. 
I am thankful to Prof. Shamim Ahmad, Principal, University 
Women's Polytechnic for permitting me to complete this work. 
I am very much obliged to my colleagues in University Women's 
Polytechnic, A.M.U., Aligarh, especially to Mr. Jahangir Alam, Mr. M. 
Ajmal Kafeel and Mr. Misbaur-Rehman Siddiqui, for providing me all 
the support and sharing my responsibilities in the department during the 
last phase of this research. Without their support, it would not have been 
possible for me to complete this work within the time frame. 
I am also grateful to another colleague Mr. Mohd. Hanzala, for 
doing pains-taking job during the write-up phase of this thesis. 
I have pleasure in expressing my thanks to my uncle Mr. Irfan Ali, 
IIT, Roorkee, for his kind help to provide me research material from the 
Library of IIT, Roorkee, many a times whenever I was in need. 
I take this opportunity to thank and acknowledge to all the staff 
members of the Department of Computer Engineering for providing me a 
congenial environment to complete this work. 
Last, but not the least, I am very much grateful to my mother and 
all the members of my family for their moral support and care that they 
shown towards me during the period of this work. 
(ABDUS SAMAD) 
III 
Abstract 
The increasing reliance on the Internet as a ubiquitous medium for 
accessing information has compelled heavy load on the network resources. As 
a result, there are number of challenges in the quality of Web services. One of 
the desired qualities is fast accessing and hence small downloads time. There 
are different approaches to improve these services. To meet the requirement of 
these services, efficient servers are designed. The technological developments 
in VLSI and use of nanotechnology have made the computer system as a fast 
and efficient system. Even then, the computing speed of a single node server 
can not satisfy the requirements of many Web applications. Therefore, 
exploiting parallelism is a necessity in the design of high performance 
computer system. In terms of hardware, this typically means providing multiple 
simultaneously active processors (nodes). In terms of software, it means 
structuring a program as a set of largely independent subtasks (load). Research 
is active in the direction of developing new multiprocessor architectures and 
scheduling the partitioned program onto it in order to achieve higher 
performance. 
The present work, reported in this thesis, is concerned with the design 
and development of a new multiprocessor architecture, called Linearly 
Extensible Cube (LEC) network, which is proposed to work as a server. A new 
dynamic scheduling scheme named as Two Round Scheduling (TRS) scheme is 
developed to maximally utilize all the nodes of the proposed server. The 
performance of this server is evaluated for balancing different types of loads 
(traffic) among various nodes of the server using TRS scheme. In addition to 
this, a second algorithm is proposed and implemented on LEC server that 
manages a number of retrieval queries. 
The LEC server model proposed is a multiprocessor network, which 
exhibits the desirable properties of similar types of multiprocessor networks. 
IV 
The LEG network has Hnear extensibiUty with only two nodes per extension. 
The network has lower diameter, hence reduces the average path-length 
traveled by all the messages, constant node degree and a high bisection width. 
A comparative analysis is made and shown in tabular form which shows the 
superiority of the proposed server in terms of topological properties. 
The dynamic scheduling scheme TRS has been proposed which 
schedules different types of load on the proposed LEG network. The TRS 
scheme uses the adjacency matrix to check the level of connectivity. The Ideal 
Load (IL) is calculated by summing the load available on each node in the 
network at a particular load stage divided by the total number of nodes in the 
network. Nodes having load greater than IL are identified as donor nodes, 
whereas, nodes having the load less than IL are acceptor nodes. Tasks are 
diffused from donor to accepter considering the connectivity of the network up 
to two levels. The proposed scheduling scheme is also implemented on other 
similar multiprocessor networks namely hypercube, de Bruijn and Linearly 
Extensible Tree (LET) networks. The results obtained from the simulation 
studies show that LEG is performing better with the proposed TRS scheme. To 
check the effectiveness of the proposed scheduling schemes, other standard 
scheduling schemes namely Minimum Distance Scheduling Scheme (MDS), 
Hierarchical Balancing Methods (HBM) and Gradient Model (GM) schemes 
are also implemented on the LEG network. The comparative simulation study 
shows that when the proposed scheme is implemented on it, it quickly balances 
the load on the proposed LEG network. Therefore, the proposed model (i.e. 
LEG network with TRS scheme) is considered as a better organizational model. 
To check the effectiveness of the LEG network when used as a server, it 
is tested for both; handling the unpredictable communication traffic and 
servicing a number of queries (requests). The proposed TRS scheme is used to 
utilize all the nodes of the LEG network for un-even communication traffic by 
migrating the load in various block sizes to check and evaluate the performance 
of the server in terms of load balancing time. Similarly, a second algorithm is 
proposed and implemented on LEC server for information retrieval. This 
scheme organizes the information available on the server (database) in such a 
way that a fast retrieval of information could be achieved. A table consisting of 
packet ID'S and their addresses is maintained which is accessible by all the 
nodes of the server. A number of search queries are processed from the 
proposed LEC server and the service/search time is evaluated. The simulation 
results obtained indicates that as the number of queries increases the search 
time remains approximately constant. The same scheme is also applied on 
single node server and a comparative simulation study is made with the 
proposed LEC server. The simulation results show that the given 
multiprocessor server reduces the information retrieval time by the number of 
processors in the network. 
In the present work, it is found that the proposed LEC architecture is an 
economical multiprocessor network which exhibits the desired properties of a 
multiprocessor architecture such as linear extensibility, low diameter, constant 
node degree and high bisection width. Simulation studies show that it is 
performing better in comparison to other similar multiprocessor networks 
particularly for uneven traffic. The proposed network when used as server 
performs effectively information retrieval efficiently and may be used in the 
Web services. 
VI 
CONTENTS 
Certificate i 
Acknowledgment n 
Abstract iv 
Contents vii 
List of Figures xi 
List of Tables xiii 
List of Abbreviations xiv 
1 Introduction 1 
1.1 Overview 1 
1.2 Overview of Multiprocessor Architectures and Scheduling 
Schemes 4 
1.3 Motivation 6 
1.4 Need for Performance Evaluation 7 
1.5 Original Contribution 8 
1.6 Thesis Organization 9 
2 Review of Multiprocessor Architectures 12 
2.1 Flynn's Taxonomy of Computer Architecture 17 
2.2 Shared Memory Organization 19 
2.3 Message Passing Organization 21 
2.4 Interconnection Networks Taxonomy 22 
2.4.1 Mode of Operation 22 
2.4.2 Control Strategy 23 
2.4.3 Switching Techniques 23 
VII 
2.4.4 Topological Taxonomy 24 
2.5 Performance Parameters of Interconnection Networks 26 
2.6 Review of Multiprocessor Interconnection Networks 27 
2.6.1 Linear Array 27 
2.6.2 Ring and Chordal Ring 28 
2.6.3 Tree and Star 28 
2.6.4 Mesh and Torus 29 
2.6.5 Systolic Arrays 29 
2.6.6 Hypercube 31 
2.6.7 Cube-Connected Cycles 32 
2.6.8 de Bruijn Network 32 
2.6.9 Linearly Extensible Tree 33 
3 Review of Scheduling Schemes 36 
3.1 Static Scheduling 37 
3.2 Dynamic Scheduling 39 
3.3 Types of Dynamic Scheduling Schemes 41 
3.3.1 Centralized 41 
3.3.2 Fully distributed 41 
3.3.3 Partially distributed 43 
3.3.4 Synchronous versus Asynchronous 43 
3.4 Dynamic Load Balancing Strategies 44 
3.4.1 Randomization 44 
3.4.2 Diffusion 44 
3.4.3 Dimension Exchange Method 46 
3.4.4 Gradient Model 46 
VIII 
3.4.5 Hierarchical Balancing Method 47 
3.4.6 Minimum Distance Scheduling 48 
3.4.6.1 Minimum Distance Property 48 
4 Linearly Extensible Cube Network 52 
4.1 Multiprocessor Interconnection Networks 53 
4.2 Hypercube network 54 
4.3 Linearly Extensible Cube (LEC) Multiprocessor Network 57 
4.3.1 Design and Analysis 57 
4.3.2 Properties ofthe LEC Network 60 
5 Performance Measure Strategies 67 
5.1 Minimum Distance Scheduling Scheme (MDS) 68 
5.2 Two Round Scheduling (TRS) Scheme 70 
5.3 Simulation Results 75 
5.3.1 Dynamic Load Model 75 
5.3.2 TRS Scheme on LEC network 77 
5.3.3 TRS Scheme on other networks 82 
5.4 Performance Study of TRS and other dynamic Scheduling 
Schemes on LEC network 87 
5.4.1 Comparison of TRS Scheme with other Scheduling 
Schemes 88 
6 LEC as Information Retrieval Server 92 
6.1 Server Architectures 94 
6.1.1 Multiprocess Server 94 
6.1.2 Multi Threaded Server 95 
6.1.3 Single Process Event Driven Sever 95 
6.1.4 Asymmetric Multi Process Event Driven Server 96 
IX 
6.2 The Proposed LEC Server 96 
6.3 Information Loading and Retrieval 103 
6.3.1 Loadingof Information on LEC Server 104 
6.3.2 Retrieval oflnformation from LEC Server 104 
6.4 Simulation Results 109 
7 Conclusion and Future Scope I l l 
7.1 Conclusions 112 
7.2 Future Work 116 
References 117 
List of Publications 140 
List of Figures 
Figure No. 
Figure 2.1 
Figure 2.2 
Figure 2.3 
Figure 2.4 
Figure 2.5 
Figure 2.6 
Figure 2.7 
Figure 2.7(a) 
Figure 2.7(b) 
Figure 2.7(c) 
Figure 2.7(d) 
Figure 2.7(e) 
Figure 2.7(f) 
Figure 2.8 
Figure 2.9 
Figure 2.10 
Figure 3.1 
Figure 4.1 
Figure 4.2 
Figure 4.3(a) 
Caption Page No. 
von Neuman architecture 12 
SIMD architecture 18 
MIMD architecture 19 
Shared memory MIMD architecture 20 
Message passing MIMD architecture 21 
A topology based interconnection 
networks taxonomy 
Various types of interconnection 
topologies 
Ring 30 
Binary Tree 30 
Star 30 
Mesh 30 
Torus 30 
Systolic Array 30 
An 8-processor hypercube network 32 
An 8-processor un-directed de Bruijn 
network 
LET network with 6 processors 33 
Scheduling taxonomy 42 
Hypercube interconnections 55 
Arrangement of processors in LEC 
network 
The LEC architecture with six processors 59 
XI 
Figure 4.3(b) 
Figure 4.4 
Figure 4.5 
Figure 5.1(a) 
Figure 5.1(b) 
Figure 5.2 
Figure 5.3 
Figure 5.4 
Figure 5.5 
Figure 5.6 
Figure 5.7 
Figure 5.8 
Figure 5.9 
Figure 5.10 
Figure 6.1 
Figure 6.2 
Figure 6.3 
Figure 6.4 
Figure 6.5 
Figure 6.6 
Adjacency matrix for LEG network with ^Q 
six processors 
Comparison of diameter of different ^. 
multiprocessor networks 
Extensibility of LEC network 64 
The LEG network 69 
Adj acency matrix for LEG 69 
Performance ofLEC network for o^ 
uniform load 
Performance of LEG network for „. 
non-uniform load 
TRS scheme on various multiprocessor r,^ 
networks 
Performance of LEG and other ^. 
multiprocessor networks 
TRS scheme on various multiprocessor j.^-
networks 
Performance ofLEG and other ^. 
multiprocessor networks 
The hierarchical structure of LEG network 88 
Comparison of TRS with other scheduling „„ 
schemes on LEG 
Comparison of TRS with other scheduling „^ 
schemes on LEG 
The LEG server 97 
Performance of LEG IR server 101 
Performance of LEG IR server 102 
Performance of LEG for various block . „ ^  
sizes 
Information loading and retrieval system 104 
Comparison between LEC and MP type , „ „ 
uni-processor servers 
XII 
List of Tables 
Table No. : Caption Page No. 
Table No. 1.1 : A summary of the results contained in the thesis 11 
Table No. 2.1 : Summary of some important interconnection . . 
network characteristics 
Table No. 4.1 : Diameter of various sized multiprocessor 
networks 
Table No. 4.2 : Summary of parameters for various 
multiprocessor networks 
Table No. 5.1 : The TRS Algorithm 72 
Table No. 5.2 : Load Migration Table for uniform load on LEC 
(TRS) upto stage 7 (Sample Output 1) 
Table No. 5.3 : Load Migration Table for non-uniform load on 
LEC (TRS) upto stage 7 (Sample Output 2) 
Table No. 6.1 : Procedure for Load Balancing (Fixed value of 
LIF) ^^ 
Table No. 6.2 : Procedure to calculate the balancing time for a 
fixed value of LIF ^"^ 
Table No. 6.3 : Procedure for searching an item (procedure: 1) 105 
Table No. 6.4 : Procedure to calculate average searching time 
(procedure: 2) ^"' 
XIII 
List of Abbreviations 
AMPED : Asymmetric Multi process Event Driven 
CCC : Cube Connected Cycle 
COMA : Cache Only Memory Architectiire 
DEM Dimension Exchange Method 
DLB Dynamic Load Balancing 
DLH Double Loop Hypercube 
DS Dynamic Scheduling 
FPGA : Field Programming Gate Array 
GAE : Generalized Adaptive Exchange 
GDE : Generalized Dimension Exchange 
GM : Gradient Model 
HBM Hierarchical Balancing Method 
HHC Hierarchical Hypercube 
IL Ideal Load 
IN Interconnection Network 
IPC Inter Process Communication 
IR : Information Retrieval 
LCS Loosely Coupled Systems 
LEG Linearly Extensible Cube 
LET Linearly Extensible Tree 
LIF Load Imbalance Factor 
MDA Minimum Distance Accepters 
MDS Minimum Distance Scheduling 
MIMD : Multiple Instruction Multiple Data Streams 
XIV 
MISD Multiple Instruction Single Data Streams 
MP : Multi Process 
MPSoC : Multiprocessor System on Chip 
MRN : Multistage Ring Network 
MT : Multi Threaded 
NUMA Non-Uniform Memory Access 
OMMH Optical Multi-Mesh Hypercube 
ORR : Ordered Round Robin 
PD Parallel Downloading 
PE : Processing Element 
P-ORR Pacaketized-Ordered Round Robin 
RID : Receiver Initiated Diffusion 
RIL Rounded Ideal Load 
SID Sender Initiated Diffusion 
SIMD Single Instruction Multiple Data Streams 
SISD Single Instruction Single Data Streams 
SMLH Spanning Multi Channel Hypercube 
SoC : System on Chip 
SPED Single Processor Event Driven 
TCP : Transmission Control Protocol 
TCS : Tightly Coupled Systems 
TGS : Task Generated at a particular load Stage 
TRS : Two Round Scheduling 
UMA Uniform Memory Access 
VLSI Very Large Scale Integration 
XV 
CHAPTER 1 
Introduction 
1.1 Overview 
The Internet provides a convenient and cost-effective communication 
platform for research and development, education and entertainment, and in 
almost every sphere of life. Consequently, the number of Internet users is 
increasing rapidly over the last few years. This large user base is placing 
significant stress on the computing resources of popular services available on 
the Internet. Providing fast, effective and reliable information retrieval from the 
Internet is becoming increasingly important. This is particularly required in 
multimedia applications especially when streaming of audio and video data is 
carried out. The success of the Internet arises from the capabilities to support 
efficient, survivable, robust, and reliable end-to-end data transfer services for 
adaptive applications running over a set of end-systems. Popular documents 
maintained at a server can attract tremendous access requests and this causes a 
disproportionate increase in client requests. On the other hand, user's 
expectations have increased; so that, the desired information should be 
downloaded in the shortest possible time. As a result, the server may not be 
able to handle the load and becomes the bottleneck. Site administrators 
constantly face the need to improve server capabilities. 
To improve the download, throughput and reUabiUty and to deal with 
the increasing load on the server, there are a number of approaches. The first 
approach is to select the "best" server available [Sayal et al., 1998], [Dykes et 
al., 2000], [Zegura et al., 2000], [Ng et al., 2003], [Ranjan et al., 2004]. These 
approaches rely on the estimates of the round trip times between the client-
server pairs and the server response times. The timing estimates are generally 
updated periodically and the requests are forwarded to the "best" server 
selected based on these estimates [Leung and Li, 2006]. The server 
performance can therefore fluctuate dramatically due to significant load 
imbalance during a download session, resulting in the deterioration of the 
quality of service, such as the download time and the packet drop rate for 
multimedia streaming [Rodriguez and Biersack, 2002]. There are other factors 
that may be considered to select the appropriate server for download such as 
server-client bandwidth, total elapsed time, server load etc. Higashi et al., 
proposed a topology aware server selection method for down loading [Higashi 
et al, 2004]. They showed that the effectiveness of downloading is strongly 
dependent on the physical network topology. 
Instead of finding the "best" server to fulfill a client's request, 
concurrent access to a set of servers for satisfying request also known as 
Parallel Downloading (PD) has been adopted by a number of Internet file 
downloading applications, [Funasaka et al., 2003], [Gkantsidis et al., 2003], 
[Chao and Li, 2004], [Ranjan et al., 2004], [Chen et al., 2005], [Karrer and 
Knightly, 2005], [Chang et al., 2008]. The purpose of PD is to increase the 
download speed compared to a single connection from a single server. In this 
scheme, a client requesting a file will open concurrent connections to multiple 
senders, which can be servers or peers. To decrease the load on the original file 
server, mirror servers are often used. Therefore, different parts of the file will 
be transmitted from the sender to the client. There have been experiments 
showing that PD results in high throughput and therefore, shorter downloading 
time is experienced by the users. The shorter downloading time, however, is 
obtained at the expenses of more overheads on managing the file to be 
downloaded from many servers and thus maintaining coordination among 
servers [Philopoulos and Maheswaran, 2001]. Rodriguez et al. proposed a 
dynamic parallel-access scheme to access multiple mirror servers. They 
showed that their dynamic parallel download scheme achieves significant 
downloading speedup which is close to the downloaded time of the fastest 
single server, without any server selection [Rodriguez and Biersack, 2002]. 
However, the main disadvantage is that a parallel access involves several costs. 
For example, overhead involved in extra server access to find out the document 
size, cost incurred to manage the increasing number of TCP connections etc. 
The work on PD has focused on downloading time experienced by an 
individual user. It is more commonly adopted for file distribution, especially 
when the size of a file is large [Rodriguez et al., 2000]. There has been no 
performance study on parallel downloading when it is used by a large number 
of users. So, there are two common but conflicting views: the first view is that 
PD is generally effective because it speeds up downloading by exploiting the 
server's capacity in a more balanced fashion. However, the other opinion is that 
parallel downloading is no better, if not worse, than the single-server 
downloading scheme, because the downloading time reduction will become 
less significant if every user chooses to perform parallel downloading [Koo et 
al., 2003]. In such cases, the efficiency will be degraded because everyone will 
be competing for system resources. Also, single download users will suffer, as 
they have to wait longer for their job to finish. Therefore, a careful control 
mechanism is required to improve the efficiency of parallel downloading, when 
multiple users use this scheme [Chang et al., 2008]. 
A more promising solution is to use a multiprocessor server architecture 
in contrast with single processor computers that manages incoming requests 
transparently by exploiting parallelism among several server nodes 
(processors). The main argument for using multiprocessors is to create 
powerful computers by simply comiecting multiple nodes. This gives more 
scope to improve the design of a server in order to achieve the desired 
performance. 
1.2 Overview of Multiprocessor Arcliitectures and Scheduling 
Schemes 
A multiprocessor system is a single computer incorporating a number of 
independent processors that work together to solve a given problem. Advances 
in hardware technology, specially the VLSI circuit and nanotechnology, have 
made it possible to build a large-scale multiprocessor system that contains 
thousands of processors [Chang and Chen, 2006]. A suitable interconnection 
network is an integral part of any high performance parallel system. They are 
represented by undirected graphs where, nodes represent the processing 
elements and edges denote the communication links between the nodes. One of 
the important factors to categorize such system is the interconnection network 
that ties them together, because the system performance is significantly 
affected by the network topology. The choice of the interconnection network 
may effect several charactaristics of the system such as node degree diameter, 
scalability and cost etc. Designing such an architecture however, is an 
important and difficult task. No interconnection topology exist which gives 
optimal performance on all accounts [Stewart and Xiang, 2008]. There are 
number of different interconnection networks which are used in commercially 
available concurrent systems and numerous research prototypes have been 
proposed and reported in the literature [Kim and Veidenbaum, 1999], 
[Parhami, 2000], [Samad and Rafiq, 2005], [Shi and Srimani, 2005], [Kwak 
and Jhon, 2007]. Among them, the binary n-cube also called hypercube 
network, has been one of the famous ones commercialy accepted and has been 
proved to be a very powerful topology [Zhang, 2002], [Towles and Dally, 
2004], [Peter et al., 2005], [Youyao et al., 2008]. The hypercube topology has 
two important properties namely small diameter and fault tolerance. The 
diameter of a network is the largest distance between any two nodes. Low 
diameter is better, because it determines the distance involved in the 
communication. The hypercube has a logarithmic diameter. The regular and 
symmetric nature of the network provides fault tolerance. 
Another important quality of a multiprocessor interconnection network 
is its ability to support future expansion. Scalable networks have the property 
that the size of the system (e.g., the number of nodes) can be increased with 
minor or no change in the existing configuration. The disadvantage of the 
hypercube network from the rquirement point of view is that it is not truly 
expandible. Each extension of this network requires doubling the number of 
nodes. Other drawback with hypercube is when considering its VLSI layout 
[Patel et al. 2000]. The minimum number of tracks for VLSI layout of an n-
dimensional hypercube using a one dimensional implementation has an order 
of network size which results into more difficulties to design and fabricate the 
nodes of the hypercube. To remove these limitations, several variations of 
hypercube such as Folded-crossed hypercube [Zhang, 2002], Dualcube [Yamin 
et al. 2004], Exchanged Hypercube [Peter et al., 2005], Necklace Hypercube 
[Monemizadeh et al. 2005] and Double-Loop Hypercube (DLH) [Youyao et 
al., 2008] have been reported. 
In addition to designing an appropriate network, distributing load (tasks) 
on multiple processors is critical in the performance of processor network and 
hence efficient scheduling schemes are implemented to enhance the 
performance of such networks. An inefficient scheduling can lead to a load 
imbalance on various nodes which can significantly increase the response times 
of tasks scheduled. There are two approaches to schedule the tasks on 
multiprocessor systems [LeMair and Reeves, 1990], [Ishfaq and Ghafoor, 
1991]. In the first approach, an application comprising a task or set of tasks 
with a priori knowledge about their characteristics is scheduled to the system 
nodes before run time. This type of scheduling problem is better described as 
the assignment or mapping problem and termed as static scheduling techniques. 
The second category of task scheduling considers the current state of the 
system while assigning tasks to the processors. These types of strategies do not 
assume a priori knowledge about the tasks and are known as dynamic 
scheduling strategies. The quality of an algorithm is also depends on the 
computing environment. Several static and dynamic scheduling schemes have 
been reported [Watts and Taylor, 1998], [Zeng and Veeravalli, 2004], [Baker, 
2005], [Meraji et al. 2007], [Chandra and Shenoy, 2008], [Dobber et al., 2009], 
[Bertogna et al., 2009]. 
1.3 Motivation 
The main idea of the research is to design a multiprocessor 
(intercormection) network with lesser number of nodes having better 
characteristics than the existing networks. Lesser number of nodes means 
economical. The other important characteristics of a multiprocessor network 
are diameter, connectivity, extensibility, fault tolerance etc. Scheduling 
schemes will be implemented on it to check its performance with other similar 
networks. The efficient management of parallelism on an interconnection 
network involves optimizing conflicting performance indices, like the 
minimization of communication and scheduling overheads and uniform 
distribution of load among the processors. The success of a multiprocessor 
system depends upon the effective utilization of nodes with uniform load 
distribution. The load distribution is optimal when all nodes have equal load. 
The present work is motivated by the requirement for the design and 
development of a network model to achieve higher performance and then to 
compare the different network characteristics with other networks. To evaluate 
the network utilization, a number of dynamic scheduling schemes will be 
implemented on the proposed architecture. The performance parameters will be 
calculated and will be compared with other similar multiprocessor networks. 
The proposed architecture will be further tested for its performance for 
the application of information retrieval from the Internet and to be used as a 
server. 
1.4 Need for Performance Evaluation 
Currently, one of the most important issues in parallel processing is how 
to effectively utilize parallel computers that have become increasingly 
complex. It is estimated that many modem super computers and parallel 
processors deliver only 10 percent or less of their peak performance potential in 
a variety of applications. Yet high performance is the very reason why people 
build complex machine. 
The causes of performance degradation are many. Performance looses 
occur because of mismatches among application, software, and hardware. In 
complex systems, mismatches may occur among software modules or hardware 
modules. The communication network bandwidth may not correspond to the 
speed of the processor or that of memory introducing unwanted latency. 
Mapping ^plications to parallel computers and balancing parallel 
processors is indeed a very difficult task, and the state of understanding in this 
area is quite inadequate. Moreover, small changes in problem size while using 
different algorithms or different applications may have undesirable effects and 
can lead to performance degradation. 
The various indices responsible for system performance include Load 
Imbalance Factor (LIF) and communication overhead, the complexity of the 
system and algorithm, efficiency of the system and speedup which are 
measures of different aspects of a computer system's performance. It is 
precisely in this area that the work presented in this thesis is based and it will 
be shown that a Linearly Extensible Cube (LEC) network with high overall 
performance has been instantiated. 
1.5 Original Contribution 
The complete work, as presented in this thesis, can be divided into three 
parts. The first part is concerned with the design and development of a low cost 
multiprocessor architecture. A new multiprocessor network topology, called 
Linearly Extensible Cube (LEC) multiprocessor network (i.e. LEC network) is 
proposed. It combines advantages of both the hypercube topology, such as 
smaller diameter, good connectivity, high bisection width, symmetry and 
simple routing with scalability and constant node degree independent to the 
network size. 
The thesis also proposes a scheduling scheme that can maintain the 
efficient utilization of all the nodes available in the network (LEC) for various 
types of load. To distribute the load evenly, a novel dynamic scheduling 
scheme named as Two Round Scheduling (TRS) scheme has been proposed 
and implemented on LEC network. The TRS scheme is also implemented on 
other standard reported multiprocessors architectures, and a comparison is 
made, which shows a highly balanced load profile on the LEC network. In 
order to confirm the performance of the proposed TRS scheme, other standard 
dynamic schemes are also implemented on LEC and the simulation resuhs are 
discussed. 
The third and last part of the thesis presents the implementation of 
another proposed dynamic algorithm to manage the retrieval of information on 
the proposed LEC network when used as a server. A number of search queries 
have been examined and the average times taken to process these queries are 
computed. These time estimates are compared with time taken by single node 
server architecture. The simulation results show that the given multiprocessor 
architecture, when implemented as a server with the proposed algorithms 
reduces the information retrieval time. 
1.6 Thesis Organization 
The thesis work is organized into seven Chapters. A Chapter wise 
outUne of the thesis is presented below: 
Chapter 2: Review of Multiprocessor Architectures. In this chapter the basic 
concepts and properties of various muhiprocessor networks have been 
discussed. The chapter then describes a review of various multiprocessor 
architectures reported in the literature. Their characteristics and properties are 
presented in tabular form. The various limitations and shortcomings of the 
available multiprocessor architectures have been mentioned. 
Chapter 3: Review of Scheduling Schemes. In this chapter a review of the 
scheduling schemes, starting from the classification to the present scenario is 
presented. A comparative discussion is made and the various factors 
influencing the run time overheads are discussed. 
Chapter 4: Linearly Extensible Cube. In the light of study and survey 
performed for various multiprocessor architectures, this chapter describes the 
details of the design of the proposed multiprocessor architecture (i.e. LEC) and 
its characteristics. The different properties of the proposed architecture have 
been done and compared with Hypercube, de Bruijn and Linearly Extensible 
Tree (LET) architectures to show its characteristics. 
Chapter 5: Performance Measure Strategies. This chapter deals with the 
scheduling of load on the proposed architecture. A new dynamic scheduling 
scheme named as Two Round Scheduling (TRS) has been proposed and 
implemented on LEC network. The TRS scheme has also been implemented on 
other standard reported muhiprocessor architectures to compare the 
performance of LEC network. The comparative performance of TRS scheme 
has been evaluated by applying other dynamic scheduling schemes on the LEC 
network. 
Chapter 6: LEC as Information Retrieval Server. This chapter is devoted for 
loading and retrieval of information on the LEC network. The proposed 
network has been tested to work as a server. A second algorithm is proposed 
and implemented for retrieval of information from the LEC server. Through 
simulation study the relative performance of the information retrieval algorithm 
is evaluated. 
Chapter 7: Conclusion and future work. It concludes the overall work and 
emphasizes the positive points of the proposed server and algorithms along 
with the scope for future extension of the work. 
10 
Table 1.1: A summary of the results contained in the thesis 
Performance Evaluation of 
Linearly Extensible 
Multiprocessor Architectures 
for Networking 
Our Contribution Chapter Publications 
Development of a new 
interconnection network 
topology 
Design and analysis of 
the proposed topology 
(LEC network) and its 
comparison with other 
similar existing 
topologies 
Chapter 4 [Samad and Rafiq, 2005] 
Evaluation of the performance 
of LEC with standard 
scheduling schemes 
Implementation of 
various standard 
scheduling schemes on 
LEC network 
Chapter 5 [Samad and Rafiq, 2005] 
Development of an 
interconnection network 
organization 
Development of a 
suitable scheduling 
algorithm for LEC 
network 
Chapter 6 [Samad et al., 2010] 
Evaluating the performance of 
LEC when used as a server 
Development of a new 
algorithm and its 
implementation on LEC 
network for 
unpredictable requests 
Chapter 7 [Samad et al., 2009] 
Use of LEC server for 
information retrieval from 
Internet 
Design and 
implementation of a new 
algorithm for information 
retrieval on LEC server 
Chapter 7 [Samad et al., 2008] 
11 
CHAPTER 2 
Review of Multiprocessor Architectures 
Over the past few decades, computer architecture has gone through 
evolutional rather than revolutional changes. The conventional von Neuman 
architecture shown in Figure 2.1, buih as a sequential machine executing scalar 
data. The sequential computer was improved from bit - serial to word - parallel 
operations, and from fixed- point to floating - point operations [Hwang, 2001]. 
However, there are many high processing applications such as nuclear weapons 
simulations, protein folding, global climate modeling, multimedia and internet 
applications etc. Therefore, high speed computing is becoming essential to 
modem research and development. Achieving high performance depends not 
only on using faster and more reliable hardware components, but also on major 
improvements in computer architecture, processing techniques and 
communication overheads. 
CPU or 
Prottuor g •i/f*!*' ' ,- • .•' OalpBl i Dtvka Ik 
t 
Memory 
Figure 2.1: von Neuman architecture 
12 
Computer architects have always strived to increase the performance of 
their computer architecture. High performance may come from fast dense 
circuitry, packaging technology, and parallelism. Having massively parallel 
computers is one of the desirable methods to achieve the required level of 
computing [Gomez et al., 2006], [Machida et al., 2008]. Parallel processors are 
computer systems consisting of multiple processing units connected via some 
interconnection network plus the software needed to make the processing units 
work together. The main argument for using multiprocessors is to create 
powerful computers by simply connecting multiple state-of-the art uni-
processors to work in parallel. In areas like molecular biology, nano-materials 
etc., the required computation needed can not be met even by the fastest 
computer available [Nehra et al., 2007]. In addition, a multiprocessor 
consisting of a number of single processors is expected to be more cost-
effective than building a high-performance single processor. Another 
advantage of a multiprocessor is fault tolerance. If a processor fails, the 
remaining processors should be able to provide continued service, although 
with degraded performance [Hesham and Mustfa, 2005]. 
There are two major factors used to categorize such systems: the 
processing units themselves, and the interconnection network that ties them 
together. A number of communication procedures exist for multiprocessing 
networks. These can be broadly classified according to the communication 
model as shared memory (single address space) versus message passing 
(multiple address spaces). Communication in shared memory systems is 
performed by 'writing to' and 'reading from' the global memory, while 
communications in message passing systems is accomplished via 'send' and 
'receive' commands. In both cases, the interconnection network plays a major 
role in determining the communication speed. Therefore, the choice to connect 
the processors in a multiprocessor system is a fundamental design decision. 
This may affect several characteristics of the final system such as node 
complexity, scalability, fault tolerance etc. For this reason, a plethora of 
13 
interconnection networks proposals have appeared in the Hterature, and 
enormous amount of research has centered on the design and analysis of these 
networks [Ravikanth et al., 1990], [Hamid and Hall, 1994], [Kumar and 
Patnaik, 1992], [Abdel and Khaled, 1998], [Yoo et al., 2000], [Shi et al., 2002], 
[Monemizadeh and sarbazi-Azad, 2005], [Kwak and Jhon, 2007], [Youyao et 
al., 2008]. 
The fundamental architectures for parallel systems are linear arrays and 
rings of processors. A number of applications have been incorporated on such 
machines whose processors are joined in the form of linear arrays and rings 
[Akl, 1997], [Leighton, 1992]. However, in presence of a faulty node, it is very 
difficult to complete the job in such machines [Stewart and Xiang, 2008]. 
Another drawback of single ring networks is the lack of scalability. A single 
ring does not scale well due to the fixed bandwidth. The hierarchical networks 
are the natural extension of single ring networks, but they also suffer with 
limited scalability and could not be considered cost effective topology 
especially for large scale multiprocessor systems [Kwak and Jhon, 2005]. The 
hierarchical ring networks can be modified as multistage ring networks (MRN) 
[Yoo et al., 2000], and Torus Ring Networks [Kwak and Jhon, 2007]. Yoo et 
al. shows that MRN is effective for global traffic on the network without the 
limitation of scalability. Similarly, Kwak and Jhon consider the Torus Ring as 
the efficient networks, when application program exhibits high memory access 
locality. Chordal rings are another class of ring network which are obtained by 
connecting one or two extra links at each vertex in a ring network. A number of 
issues on Chordal rings have been addressed for the past two decades, which 
includes: the diameter problem, the shortest path problem, the routing and the 
fault tolerance problems [Zimmerman and Esfahanian, 1992], [Narayan and 
Opartny, 1999], [Parhami and Kwai, 1999]. The Chordal ring obtained by 
adding two extra links has a high reliability and fault tolerance [Yang et al., 
2007]. Another solution for large scale multiprocessor networks is the cube 
14 
based networks, which offer a rich interconnection structure with a high degree 
of fault tolerance. 
One of the most famous and widely commercially accepted network 
topology is the binary n-cube also known as the hypercube network [Patel et 
al., 2000], [Towles and Dally, 2004], [Peter et al., 2005]. The hypercube 
topology has been used in numerous distributed-memory multiprocessors such 
as the Cosic Cube, Ametek S/14, iPSC, the Ncube etc [Hwang, 2001]. The 
attaractiveness of the hypercube topology is its small diameter and fault 
tolerance. The diameter of a network is the largest distance a message has to 
travel to reach its final destination between two nodes. Low diameter is better, 
because the diameter indicates the maximum number of distinct hops between 
source and destination nodes. The hypercube has a logarithmic diameter. It is 
identical to the degree of a node n = logaN. There are 2" nodes contained in the 
hypercube, which are numbered using a string of n-bits. Two nodes are 
adjacent if and only if they differ at exactly one bit position. This property 
greatly facilitates the routing of messages through the network. In addition, the 
regular and symmetric nature of the network provides fault tolerance. 
The other important parameters of an interconnection network are its 
scalability and moularity. Scalable networks have the property that the size of 
the system (e.g., the number of nodes) can be increased with minor or no 
change in the existing configuration. The increase in system size is expected to 
result in an increase in performance to the extent of increase in size. The major 
drawback of the hypercube network is that it is not hardware scalable. As the 
dimension of the hypercube is increased by one, one more link is to be added to 
every node in the network. This limits the use of hypercube network in building 
large size systems out of small size systems with little changes in the 
configuration. To remove this limitation, the cube-connected cycles (CCC) 
network was designed as a substitute for the hypercube [Vuillemin and 
Preparata, 1981]. The node degree of CCC is restricted to three. However, this 
15 
restriction degrades the performance of CCC at the same time. For example, 
CCC has a larger diameter and more complex routing than the hypercube. In 
addition to the changes in the node configuration, at least a doubling of the size 
is required for the regular hypercube or CCC. 
The other network that compromises, between hypercube and CCC is 
hierarchical hyper cube (HHC) [Malluhi and Bayoumi, 1994]. HHC has a two 
level sructure, takes hypercube as basic modules and connect them in a 
hypercube manner. Therfore its topology is similar to that of hypercube and it 
has a logarithmic diameter, which is also the same as the hypercube. 
Neverthless, HHC still suffers from the limitation of scalability because of not 
having constant node degree. Similarly an Optical Multi-Mesh hypercube 
(OMMH) is reported [Louri and Sung, 1994], which combines the positive 
feature of hypercube with mesh topology. The Spanning Multi-channel Linked 
Hypercube (SMLH) is another variation of hypercube having constant degree 
and a constant diameter [Neocleous et al. 1998]. However, the routing of these 
networks becomes more complex then hypercube. Several other variations of 
the hypercube architecture are reported in the literature, such as folded 
hypercube [Latifi and Amawy, 1991], Gaussian Hypercube [Hsu et al., 1996], 
Metacube [Yamin et al. 2001], Folded-crossed hypercube [Zhang, 2002], 
Dualcube [Yamin et al. 2004], Exchanged Hypercube [Peter et al, 2005], 
Double-Loop Hypercube [Youyao et al., 2008]. 
Other drawback with hypercube is when considering its VLSI layout. 
The minimum number of tracks for VLSI layout of an n-cube hypercube using 
a one dimensional implementation has an order of network size which results 
more difficulties to design and fabricate the nodes of the hypercube [Patel et al. 
2000]. A modified hypercube architecture named the necklace hypercube 
which has a good scalability and efficient VLSI layout has been reported 
[Monemizadeh and Sarbazi-Azad, 2005]. 
16 
In this chapter first the basic concepts of multiprocessor architectures 
are described followed by an overview of the different topologies used for 
interconnecting multiple processors with their important parameters are 
discussed. 
2.1 Flynn's Taxonomy of Computer Architecture 
The most popular taxonomy of computer architecture was defined by 
Flynn in 1966 [Mano, 2003]. Flynn's classification scheme is based on the 
notion of a stream of information. Two types of information flow into a 
processor: instructions and data. The instruction stream is defined as the 
sequence of instructions performed by the processing unit. The data stream is 
defined as the data traffic exchanged between the memory and the processing 
unit. According to Flynn's classification, either of the instruction or data 
streams can be single or multiple. Computer architecture can be classified into 
the following four distinct categories: 
• Single-instruction single-data streams (SISD) 
• Single-instruction multiple-data streams (SIMD) 
• Multiple-instruction single-data streams (MISD) 
• Multiple-instruction multiple-data streams (MIMD) 
Conventional single-processor von Neumann computers are classified as 
SISD systems. Parallel computers are either SIMD or MIMD. The SIMD 
model of parallel computing consists of two parts: a front-end computer of the 
usual von Neumann style, and a processor array as shown in Figure 2.2. The 
processor array is a set of identical synchronized processing elements capable 
of simultaneously performing the same operation on different data. Each 
processor in the array has a small amount of local memory where the 
distributed data resides while it is being processed in parallel. There is only one 
17 
control unit and all processors execute the same instruction in a synchronized 
fashion. In SIMD architecture, parallelism is exploited by applying 
simultaneous operations across large sets of data. It is especially powerful in 
many numerical calculations such as numerical weather forecasting, finite 
element analysis, earth simulation etc. 
Control 
Unit 
T 
""; 
Program 
Data Stream 
Data loaded 
from front end 
jf'J "~^ 
.P»»* > 
Figure 2.2: SIMD architecture 
MIMD parallel machines are made of multiple processors and muhiple 
memory modules connected together via some interconnection network. In 
these architectures, each processor has its own control unit and can execute 
different instructions on different data as shown in Figure 2.3. 
18 
Instruction Stream 
Instruction Stream 
Control 
Unit 
Instruction Stream 
Instruction Stream 
Figure 2.3: MIMD architecture 
In the MISD category, the same stream of data flows through a linear 
array of processors executing different instruction streams. In practice, there is 
no viable MISD machine, however, some authors have considered pipelined 
machines as examples for MISD. 
MIMD machines are considered as the real parallel or multiprocessors 
machines. They fall into two broad categories: shared memory or message 
passing. 
2.2 Shared Memory Organization 
Shared Memory Systems form a major category of multiprocessors 
[Kim and Veidenbaimi, 1999]. A shared memory system typically 
accomphshes inter-processor coordination through a global memory shared by 
all processors. These systems are also referred to as Tightly Coupled Systems 
(TCS). Communication between tasks running on different processors is 
performed by reading and writing locations in a shared memory that is equally 
accessible by all processors. A shared memory computer system consists of a 
set of independent processors, a set of memory, and an intercoimection network 
as shown in Figure 2.4. 
19 
Figure 2.4: Shared memory MIMD architecture 
A number of basic issues in the design of shared memory systems have 
to be taken into consideration. These include access control, synchronization, 
protection, and security. Shared memory systems can be designed using bus-
based or switch-based interconnection networks. 
Depending on the interconnection network, shared memory systems can 
be classified as: Uniform Memory Access (UMA), Non-Uniform Memory 
access (NUMA), and Cache-Only Memory Architecture (COMA). In the UMA 
system, a shared memory is accessible by all processors through an 
interconnection network in the same way a single processor accesses its 
memory. Therefore, all processors have equal access time to any memory 
location. The interconnection network used in the UMA can be a single bus, 
multiple buses, a crossbar, or a multi port memory. The UMA systems have 
some drawbacks of scalability and memory bottleneck [Kwak and Jhon, 2007]. 
To remove these drawbacks, NUMA systems are designed. In these systems, 
each processor has part of the shared memory attached. The memory has a 
single address space. Therefore, any processor could access any memory 
location directly using its real address. However, the access time to modules 
20 
depends on the distance to the processor. This resuhs in a non-uniform memory 
access time. A number of architectures are used to interconnect processors to 
memory modules in a NUMA [Grindley et al., 2000]. 
Similar to the NUMA, each processor has part of the shared memory in 
the COMA. However, in this case the shared memory consists of cache 
memory. A COMA system requires that data be migrated to the processor 
requesting it. 
2.3 Message Passing Organization 
Message passing systems are a class of multiprocessors in which each 
processor has access to its own local memory. Unlike shared memory systems, 
communications in message passing systems are performed via send and 
receive operations. These systems are also referred to as a Loosely Coupled 
Multiprocessor Systems (LCS). A combination of a processor, I/O interface 
and its local memory in such a system is considered as a node, as shown in 
Figure 2.5. 
Interconnection Network 
Figure 2.5: Message passing MIMD architecture 
21 
Nodes are typically able to store messages in buffers (temporary 
memory locations where messages wait until they can be sent or received), and 
perform send/receive operations at the same time as processing. The degree of 
coupling in such a system is very loose that is why it is also referred to as a 
distributed system. The determinant factor of the degree of coupling is the 
interconnection network. The processing units of a message passing system 
may be connected in a variety of ways ranging from architecture-specific 
interconnection structures to geographically dispersed networks. The message 
passing approach is, in principle, scalable to large proportions. They are usually 
efficient when the interaction between tasks is minimal [Rafiquzzaman and 
Chandra, 1992]. 
2.4 Interconnection Networks Taxonomy 
Multiprocessors interconnection networks (INs) can be classified based 
on a number of criteria [Hesham and Mustfa, 2005]. These include: 
(1) Mode of operation (synchronous versus asynchronous) 
(2) Control strategy (centralized versus decentralized) 
(3) Switching techniques (circuit versus packet), 
(4) Topology Taxonomy 
2.4.1 Mode of Operation 
According to the mode of operation, INs are classified as synchronous 
versus asynchronous. In synchronous mode of operation, a single global clock 
is used by all components in the system such that the whole system is operating 
in a lock-step manner. Asynchronous mode of operation, on the other hand, 
does not require a global clock. Handshaking signals are used instead in order 
to coordinate the operation of asynchronous systems. While synchronous 
systems tend to be slower compared to asynchronous systems, they are race 
and hazard-free. 
22 
2.4.2 Control Strategy 
According to the control strategy, INs can be classified as centralized 
versus decentralized. In centralized control systems, a single central control 
unit is used to oversee and control the operation of the components of the 
system. In decentralized control, the control function is distributed among 
different components in the system. The function and reliability of the central 
control unit can become the bottleneck in a centralized control system. While 
the crossbar is a centralized system, the multistage interconnection networks 
are decentralized. 
2.4.3 Switching Techniques 
Interconnection networks can be classified according to the switching 
mechanism as circuit versus packet switching networks. In the circuit switching 
mechanism, a complete path has to be established prior to the start of 
communication between a source and a destination. The established path will 
remain in existence during the whole communication period. In a packet 
switching mechanism, communication between a source and destination takes 
place via messages that are divided into smaller entities, called packets. On 
their way to the destination, packets can be sent from a node to another in a 
store-and-forward manner until they reach their destination. While packet 
switching tends to use the network resources more efficiently compared to 
circuit switching, it suffers from variable packet delays. 
Wormhole switching is another technique, where a message is also 
broken into smaller parts (called flits), as in packet switching. However, the 
difference in wormhole and packet switching is that all flhs follow the same 
route. 
23 
2.4.4 Topological Taxonomy 
An interconnection network topology is a mapping function from the set 
of processors and memories onto the same set of processors and memories. In 
other words, the topology describes how to connect processors and memories 
to other processors and memories. 
A fiilly connected topology, for example, is a mapping in which each 
processor is connected to all other processors in the computer. The number of 
hops in a path from source to destination node is equal to the number of point-
to-point links a message must traverse to reach its destination. In a network, a 
single message may have to hop through intermediate processors on its way to 
its destination. Therefore, the uhimate performance of an interconnection 
network is greatly influenced by the number of hops taken to traverse the 
network. Therefore topology plays an important role in the design and 
implementation of a multiprocessor system. 
An interconnection network could be either static or dynamic. In static 
networks, direct fixed links are established among nodes to form a fixed 
network, while in dynamic networks, connections are established as needed. 
Switching elements are used to establish connections among inputs and 
outputs. Depending on the switch settings, different interconnections can be 
established. Nearly all computer systems can be distinguished by their 
interconnection network topology. Figure 2.6 illustrate the classification of 
interconnection network topologies. 
24 
I 
i 
I 
o 
c 
o 
CO M 
ki O 
c 
.2 
'4-» 
o 
o o 
a 
OX) 
o 
so 
s 
25 
2.5 Performance Parameters of Interconnection Networks 
This section defines the various methods of connecting processors in a 
parallel computer. A processor organization can be represented by a graph in 
which the nodes (vertices) represent processors and the edges represent 
communication channels between pairs of processors. These processors 
organization could be evaluated based on certain criteria's or properties of the 
organization [Quinn, 2002], [Hwang, 2001], [Hwang and Briggs, 1985]. These 
properties help to understand the effectiveness of a particular organization. The 
various properties are: 
• Number of Nodes (N): The number of nodes in a multiprocessor network 
plays a vital role by virtue of which the performance of the system is 
evaluated. Higher the number of nodes, higher is the system performance 
but the complexity increases. Therefore, number of processors should be 
optimal. 
• Diameter (D): The diameter of a network is the measure of the largest 
distance between two nodes. It determines the distance involved in 
communication and hence the performance of multiprocessor systems. 
In simple words diameter of a network is the maximum shortest path 
between source and destination node. The path length is measured by the 
number of links traversed. The network diameter indicates the maximum 
number of distinct hops between any two nodes, thus provides a figure of 
communication merit for the network. Thus the diameter should be as small 
as possible from a communication point of view. 
• Degree (d): The degree in a network is defined as the number of 
connections required at each node. It is the connectivity among different 
nodes in a network. The degrees of nodes determine the complexity of the 
network. Therefore, the node degree should be kept as low, as possible in 
26 
order to reduce cost. It is best if the number of edges per node is a constant 
independent of the network size, because the processor organization then 
scales more easily to systems with large number of nodes. 
• Extensibility: It is the property which facilitates large sized system out of 
small ones with minimum changes in the configuration of the nodes. It is 
the smallest increment by which the system can be expanded in a useful 
way. In order to avoid the increasing complexity of the system, the 
expansion must be linear. 
• Bisection Width (b): The bisection width of a network is the minimum 
number of edges that must be removed in order to divide the network into 
two halves (within one). High bisection width is better, because in 
algorithms requiring large amounts of data movements, the size of the data 
set divided by the bisection width puts a lower bound on the complexity of 
the parallel algorithm. 
With this background some important processor organizations have been 
discussed in the next section. 
2.6 Review of Multiprocessor Interconnection Networks 
A number of regular interconnection patterns have evolved over the 
years. This section presents some of the important multiprocessor 
interconnection networks reported in the literature [Quinn, 2002]. These 
patterns include: 
2.6.1 Linear Array 
This is a one dimensional network in which N nodes are connected by 
N-1 links in a line. Internal nodes have degree 2 and the terminal nodes have 
degree 1. The diameter is N-1, which is rather long for large N. The Bisection 
Width b = 1. Linear arrays are the simplest connection topology. As the 
27 
diameter increases linearly with respect to N, it should not be used for large N. 
For very small N, it is rather economical to implement a linear array. 
2.6.2 Ring and Chorda! Ring 
A ring is obtained by connecting the two terminal nodes of a linear array 
with one extra link (Figure 2.7 (a)). A ring can be unidirectional or 
bidirectional. It has a constant node degree of d = 2. The diameter is [}if2\ for 
a bidirectional ring and N for unidirectional ring, where [xj denotes the 
greatest integer that does not exceed x. Chordal rings are a variation of ring 
networks. By increasing the node degree from 2 to 3 or 4, chordal rings could 
be obtained. For degree 3, one extra link and for degree 4, two extra links are 
added to produce the two chordal rings, respectively. However, the more links 
are added, the higher the node degree and the shorter the network diameter. It 
has been reported that a chordal ring obtained by adding two chords at each 
vertex in a ring, the reliability of the network could be enhanced [Yang et al., 
2007]. 
2.6.3 Tree and Star 
A binary tree of 15 nodes in four levels is shown in Figure 2.7 (b). In 
general a n-Ievel, completely balanced binary tree should have N = 2" - 1 
nodes. The degree of binary tree is 3 and the diameter is 2(n - 1). The binary 
tree has low diameter, but has a poor bisection width of 1. With a constant node 
degree the binary tree is a scalable architecture. 
The Star is a two level tree with an attractive node degree of d = n - 1 
(Figure 2.7 (c)) and a small constant diameter of 2. In general the network 
diameter of the n-star equals |_3(n-l)/2j with a network size n! [Imani and 
Azad, 2007], [Day and Tripathi, 1994]. The n! indicates the factorial of n. 
28 
2.6.4 Mesh and Torus 
A 3 X 3 mesh network is shown in Figure 2.7 (d). This is a popular 
architecture. In general, a k-dimensional mesh with N = n nodes has an 
interior node degree of 2k and the network diameter is k(n - 1). The node 
degrees at the boundary and comer nodes are 3 and 2. The bisection width of a 
k-dimensional mesh with n''nodes is n''"'. 
The torus shown in Figure 2.7 (e) can be viewed as another variant of 
the mesh with an even shorter diameter. This topology combines the ring and 
mesh and extends to higher dimensions. The torus has ring connections across 
each row and along each column of the array. In general an n x n torus has a 
node degree of 4 and a diameter of 2 Ln/2J. 
2.6.5 Systolic Arrays 
This is a class of multidimensional pipelined array architectures. It 
consists of a set of processing elements (PE) connected in mesh-like topology, 
each capable of performing some simple operations. An example of a systolic 
algorithm might be designed for matrix multiplication. A systolic array 
specially designed for performing matrix-matrix multiplication is shown in the 
Figure 2.7 (f). 
By replacing a single processing element with an array of PEs, a higher 
computational throughput can be achieved without increasing memory 
bandwidth. Systolic arrays provide faster and scalable architectures. However, 
they are complex, expensive and highly specialized for particular applications 
[Petkov, 1992]. 
29 
(a) Ring (b) Binary Tree 
(c) Star 
(e) Torus 
(d) iVIesh 
Figure 2.7: Various types of interconnection topologies 
30 
2.6.6 Hypercube 
It is binary n-cube architecture and is considered as one of the most 
popular topology [Towles and Dally, 2004], [Grama et al., 2003], [Peter et al., 
2005]. In general, an n-cube consists of N = 2" nodes spanning along n 
dimensions, with two nodes per dimension. A 3-cube with 8 nodes is shown in 
Figure 2.8. A 4-cube can be formed by interconnecting the corresponding 
nodes of two 3-cubes. The node degree of an n-cube equals n and so does the 
network diameter. The bisection width of that size network is 2""'. The 
hypercube organization has low diameter and high bisection width at the 
expense of the number of edges per node and the length of the longest edge. 
The length of the longest edge in a hypercube network increases as the number 
of nodes in the network increases. In fact the number of nodes increases 
exponentially with respect to the dimension, making it difficult to consider the 
hypercube a scalable architecture. The other variations of the hypercube 
architecture are Folded hypercube [Latifi and Amawy, 1991], Hierarchical 
Cubic Network [Ghose and Desai, 1995], Metacube [Yamin et al. 2001], 
Folded-crossed hypercube [Zhang, 2002], Dualcube [Yamin et al. 2004], 
Exchanged Hypercube [Peter et al., 2005], Gaussian Hypercube [Hsu et al., 
1996]. However, one of the common drawbacks of these variations is the 
complex routing. Monemizadeh and Sarbazi-Azad proposed new modified 
hypercube architecture named the necklace hypercube. It has a good scalability 
and efficient VLSI layout that make the necklace hypercube more attractive 
than the hypercube network [Monemizadeh and Sarbazi-Azad, 2005]. Recently 
a new scalable interconnection network topology based on the hypercube 
architecture called Double-Loop Hypercube (DLH) has been reported which 
combines the positive features of the hypercube topology and constant node 
degree of a new double-loop topology [Youyao et al., 2008]. 
31 
Figure 2.8: An 8-processor hypercube network 
2.6.7 Cube-Connected Cycles 
This architecture is modified from hypercube i.e. a 3-cube is modified to 
form a 3-cube-comiected cycles (CCC) restricted the node degree to 3 
[Vuillemin and Preparata, 1981]. The idea is to replace the comer nodes 
(vertices) of the 3-cube with a ring of 3-nodes. In general one can construct k-
cube-connected cycles from a k-cube with n=2'' rings nodes. The idea is to 
replace each vertex of the k-dimensional hypercube by a ring of k nodes. A k-
cube can be thus transformed to a k-CCC with k x 2"^  nodes. The major 
improvement of a CCC lies in its constant node degree of k, which is 
independent of the dimension of the underlying hypercube. On the other hand, 
it gives a large diameter which ultimately results complex routing then the 
hypercube [Youyao et al., 2008]. 
2.6.8 de Bniijn Network 
The de Bruijn interconnection network is a versatile topology having a 
fixed degree per node [Ravikanth et al., 1988], [Samathan and Pradhan, 1989]. 
It consists N = 2" nodes. The number of edges per node is a constant equal to 4, 
independent of the network size. The bisection width of a de Bruijn network 
with 2" nodes is 2"/n, and the length of the longest edge increases with the size 
32 
of the network. The de Bruijn network contains shuffle connections and the 
diameter with 2" nodes is n. 
Figure 2.9: An 8-processor un-directed de Bruijn network 
2.6.9 Linearly Extensible Tree (LET) 
A binary type network topology shown in Figure 2.10 has been reported 
[Rafiq et al., 1999]. Architecture of LET exhibits better connectivity, lesser 
number of nodes and linear extensibility over Hypercube and de Bruijn 
networks. The network has low diameter, hence reduces the average path-
length traveled by all messages and contains a constant degree per node. The 
LET network grows linearly in a binary tree like shape. In a binary tree the 
number of nodes at level j is 2j whereas in LET network the number is (j+l). 
Figure 2.10: LET network with 6 processors 
33 
Table 2.1 summarizes the above characteristics for various 
multiprocessor interconnection networks. 
Table 2.1: Summary of some inportant interconnection network characteristics 
Type Size(N) 
(Nodes) 
Degree 
(d) 
Diameter 
(D) 
Bisection 
Width (b) 
Extensibility 
Linear Array N 2 N - 1 1 Linear (N+1) 
Ring N 2 LN/2J 2 Linear (N+1) 
Binary Tree N = 2"-l 3 2(n- 1) 1 Linear (N+2) 
n-Star 
(k-dimensional) 
N = n! n-1 L3(n-l)/2j N-1 Factorial (n!) 
Mesh 
(k-dimensional) 
N = n'^  2k k(n-l) n^ -' Exponential (n'^ ) 
Torus (2D) N = nxn 4 2\nl2\ 2n Exponential (n )^ 
Hypercube N = 2" N N 2"-' Exponential (2") 
CCC N = n2" 3 2n 2""Vn Exponential (n2") 
Necklace-
Hypercube N=2" + nk2""' 2n n + k 2" 
Exponential 
(2" + nk2""') 
de Bruijn N = 2" 4 N 27n Exponential (2") 
LET N=Xk 4 VN 21og2(n-2) Linear(n+1) 
34 
From the above discussion and the comparative study made in Table 2.1, 
it may be concluded that: 
• The node degrees of most networks are less than 4, which is 
desirable. However, the star network has a very high node degree. 
• The hypercube node degree increases with logpN and is also not 
appealing when, N becomes large. Similarly, in Necklace-
Hypercube, there are 2" nodes of degree 2n, which is not desirable. 
• Networks diameters vary over a wide range. Although, the 
hypercube and de Bruijn have small diameters but both are lacking in 
terms of scalability. 
• The Necklace-Hypercube and Double-loop hypercube has a good 
scalability, but still there is exponential extensibility, which is un-
desirable. 
The core of all efforts to exploit the potential power of parallel systems 
is not only limited to design the efficient architectures, rather the second 
important issue is the balancing of the computational load over these networks. 
Load balancing is must to maximize processor utilization, avoid any waste of 
resources, and to increase the overall performance of the system. Therefore, a 
number of scheduling schemes to balance the load have been discussed in 
Chapter 3. 
35 
CHAPTER 3 
Review of Scheduling Scliemes 
In addition to designing an appropriate multiprocessor network, the 
efficient management of parallelism on an interconnection network involves 
optimizing conflicting performance indices, like the minimization of 
communication and scheduling overheads and uniform distribution of load 
among the processors (nodes). In such a system more than one nodes process 
the various jobs concurrently. Each job may consist of various tasks that could 
be executed independently. The number of tasks allocated to each processor 
has to be controlled in such a way that a high speed execution of processes may 
occur while maintaining high processor utilization. Due to the unevenly 
division of tasks (load), some processors may complete execution of their tasks 
before others and become idle. In a multiprocessor system, if some nodes 
remain idle while others are extremely busy, system performance will be 
degraded drastically. Therefore, scheduling of tasks becomes an important 
problem for multiprocessor system architectures and consequently it has a 
substantial effect on the system performance and utilization. It is required that 
all the processors should share the load evenly that would lead to complete the 
job in minimum possible time. 
The scheduling problem is to maintain a balanced execution of all the 
tasks among the various available nodes in a multiprocessor network. A 
collection of independent tasks originate and mapped on the root processor. A 
36 
scheduling policy assumes a set of processors and a set of tasks which are to be 
serviced by these processors according to a specific policy. This chapter studies 
the different methods and scheduling algorithms that have been proposed for 
scheduling the load on specific interconnection network topologies. 
Scheduling may be performed at the local level or global level based on 
the information they use to make load balancing decisions [Zaki et al., 1997]. 
In a uni-processsor system, scheduling is performed by the operating system on 
the basis of the time-slices of the processor known as local scheduling. Global 
scheduling, however, decides the processor in a multiprocessor system on 
which a process is to be executed [Sharma et al., 2008]. In the global schemes, 
the scheduling decision is made using global knowledge: i.e. all the processors 
take part in the synchronization and send their performance profiles to the 
scheduler. Scheduling algorithms can be classified as either static or dynamic. 
The static algorithm performs by a predetermined policy, whereas, the dynamic 
algorithm makes its decision at run time according to the status of the system 
[Zeng and Veeravalli, 2006]. 
3.1 Static Scheduling 
In static scheduling, processes are assigned to processors during the 
execution. Information regarding the total mix of processes in the system as 
well as all the independent subtasks involved in a job or task is assumed to be 
available by the time the program object modules are linked into load modules. 
Hence, each executable image in a system has a static assignment to a 
particular processor, and each time that process image is submitted for 
execution, it is assigned to that processor. In other words, static scheduling 
requires partitioning the job into a set of independent tasks and then statically 
allocating them to processors so as to have maximum balance. Static 
scheduling has been well studied [Grosu and Chronopouls, 2002], [Zeng and 
Veeravalli, 2004], [Li and Kamede, 1998]. Houle et al. discussed the problems 
for static load balancing on trees, assuming that the total load is fixed [Houle et 
37 
al., 2002]. The goal of static load balancing method is to reduce the overall 
execution time of a concurrent program while minimizing the communication 
delays. Some examples of static algorithms presented by [Sharma et al, 2008] 
are: 
• Round Robin and Randomized Algorithms 
• Central Manager Algorithm 
• Threshold Algorithm 
A modified Round Robin scheduling for divisible load named Ordered 
Round-Robin (ORR) scheduling was proposed by [Yao et al., 2008]. The 
theoretical derivation and analysis are discussed. They have further proposed 
and designed the packetized version of the ORR algorithm named Packetized-
ORR (P-ORR) to deal with variable length packets. 
The static scheduling is generally performed in multi-computer systems 
where the load is distributed across different computers before execution using 
a priori known information and the load distribution remains unchanged at 
runtime [Nehra et al., 2007]. A general drawback that exists in all the static 
schemes is that the final selection of a host for process allocation is made when 
the process is created and can not be changed during process execution to make 
changes in the system load [Sharma et al., 2008]. The main objective of load 
balancing methods is to speedup the execution of applications on resources 
whose workload varies at run time in unpredictable way. Static scheduling 
however, avoids the run-time scheduling overhead. Therefore, in a 
multiprocessor environment with load changes on the nodes, a more dynamic 
approach is required. However, the definition between static and a dynamic job 
allocation algorithm is not very clear and different authors use slightly different 
definitions of static and dynamic algorithms. Recently, a hybrid scheduling 
approach has received some attention [Boeres et al., 2003], [Subrata et al., 
2008]. A hybrid load balancer attempts to combine the quality of static and 
38 
dynamic job allocation algorithm, by minimizing their relative inherent 
disadvantages. Static scheduling can be further divided into the following 
categories: 
• Optimal versus Sub-optimal: In static scheduling, it is assumed that all 
information governing the scheduling decisions that can include the 
characteristics of the jobs, the computing nodes, and the communication 
network are known in advance. An optimal scheduling decision is made 
deterministically based on some criterion function. On the other hand, if 
these problems are computationally not feasible, a sub-optimal or 
probabilistic decision may be applied [Darbha and Agrawal, 1998], 
[Park and Choe, 2002]. There are two ways to obtain the sub-optimal 
solutions namely approximate and heuristic. 
• Approximate versus Heuristic: In approximate scheduling, same formal 
computational model is used but, instead of searching the entire solution 
space for an optimal solution, one is satisfied when a good one is found. 
A primary intention of heuristic is to find a solution as fast as possible, 
if necessary, at the cost of quality. Heuristics are characterized by their 
essentially deterministic operation [Lee and Zomaya, 2008]. There are 
various methods of task allocation on the processors, where the 
scheduling decisions may be optimal or sub-optimal shown in Figure 
3.1. 
3.2 Dynamic Scheduling 
In Dynamic Scheduling (DS), the load is distributed among the 
processors during execution time in such a way that each processor would have 
the same or nearly the same amount of work to do. This 
redistribution/allocation of load is performed by transferring the tasks from the 
over-loaded processors to the under-loaded processors with the aim of 
obtaining the highest possible execution speed. DS schemes are widely 
39 
recognized as important techniques for the efficient utiUzation of resources in a 
multiprocessor system. The performance of such a system may be increased by 
increasing the utilization of CPU, memory or a combination of CPU and 
memory [Qin et al., 2003]. There has been a lot of research for dynamic load 
balancing (DLB) in traditional parallel and distributed systems literature for 
more than two decades [Ishfaq and Ghafoor, 1991], [LeMair and Reeves, 
1993], [Zaki et al., 1997], [Watts and Taylor, 1998], [Anand et al., 1999], 
[Ciardo et al., 2001], [Dobber et al., 2005], [Zeng and Veeravalli, 2006], 
[Dobber et al., 2009]. Yagoubi and Slimani addressed the problem of load 
balancing in grid computing that works on grid having tree type architecture 
[Yagoubi and Slimani, 2006]. The important issues in DLB are: 
• When to invoke a balancing operation. 
• Who makes load balancing decision according to what information, and 
• How to manage load migration between processors. 
Besides, there are two important parameters when dynamic scheduling 
algorithms are implemented on parallel systems. The first is parallel systems 
generally use a regular point-to-point intercormection network, instead of 
random network configuration. Similarly, the load imbalance occurs mainly, 
because of the un-even and unpredictable nature of tasks. Dynamic approaches 
have a major drawback, they are very much sensitive to inaccuracies in 
performance prediction information that the algorithm uses for job allocation 
purposes. Due to this high sensitivity, they produce extremely poor results even 
when the information accuracy is only slightly less than 100 percent. Secondly, 
in parallel systems, it is very hard to achieve and maintained 100 percent 
accurate information [Subrata et al., 2008]. 
40 
3.3 Types of Dynamic Scheduling Schemes 
In a multiprocessor system, the effect of a scheduling operation is 
observed on all the co-operating processors. There are different models of DS 
which decides the effect and co-ordination across the processors [Banicescu 
and Velusamy, 2002], [Dobber et al., 2004], [Attiya, 2004], [Corbalan et al., 
20005], [Beaumont et al., 2008], [Chandra and Shenoy, 2008]. The various 
models of DS are shown in Figure 3.1. The different models are: 
3.3.1 Centralized 
Centralized load balancing policies are characterized by the use of a 
dedicated processor which also takes part in computation. This processor is 
also known as master processor or central scheduler responsible to make the 
entire load balancing decisions [Lin and Raghavendra, 1992]. The master 
processor gathers the global information about the state of the system and 
assigns tasks to individual node. In this way it can improve the resource 
utilization by applying sophisticated algorithms. However, for large system 
consisting of 100 or 1000 of nodes, the master processor becomes a bottleneck. 
Moreover, if the central processor fails, the whole system stops working. 
3.3.2 Fully distributed 
It is an alternative to centralized approach, in which the load balancing 
decisions are carried out by all the processors of the system. Each node 
executes a scheduling algorithm by exchanging information with other nodes. 
It is therefore very costly for each node to obtain and maintained the dynamic 
state information of the whole system [Shivaratri et al., 1992]. 
41 
Scheduling Schemes 
Heuristic 
QoW 
W Static 1 
3^ 
App-onmate 
o-
Gnumeraliw Gra|A TTieory 
MUtaematical 
Programming 
Queuing 
tlieor]' 
Sufaoptinial Optimal 
App-oiinMte Heuristic 
Figure 3.1: Scheduling taxonomy 
42 
Many variants of fully distributed schemes appear in the literature. 
However, there are number of drawbacks associated with them [Ishfaq and 
Ghafoor, 1991]. The first is that for large system (more than 100 processors), 
optimal scheduling decisions are difficult to make, even if the correct decisions 
are made it results in a high control overhead at heavy load conditions. The 
second drawback is that the fully distributed algorithms use partial information 
about the state of the system for suboptimal decisions [Subrata et al, 2008]. A 
reduced amount of information results in a smaller range of scheduling options. 
Other problem with fully distributed schemes is of communication delays, 
which may turns a correct scheduling decision into a wrong choice. Therefore, 
it may be concluded that, fully distributed algorithm is a better option for small 
to moderate systems. 
3.3.3 Partially distributed 
Partially distributed or some times called as semi-distributed algorithms 
are proposed as a trade-off between centralized and fully distributed scheduling 
schemes [Ishfaq and Ghafoor, 1991]. The main idea is that the system is 
divided into different regions and thus the load balancing problem is divided 
into subtasks. Each region is generally managed by a single master processor 
using a centralized scheme. Master processors of each region may exchange 
information for balancing the load dynamically in the system. 
3.3.4 Synchronous versus Asynchronous 
The fully distributed and partially distributed schemes may fiirther be 
categorized as synchronous and asynchronous based on the instant at which 
load balancing operations are made, hi synchronous schemes all processors 
involved in load balancing carry out balancing operations instantly. Each 
processor can not proceed with normal computation until the load migrations 
demanded by the current operations have been completed. On the other hand, 
in asynchronous approach the running processor takes the load balancing 
43 
decision independently. Each processor performs the balancing action 
regardless what the other processor doing at that time. A number of 
synchronous and asynchronous load balancing algorithms have been discussed 
in the literature [Bahi et al., 2005]. 
3.4 Dynamic Load Balancing Strategies 
There are number of approaches to solve non-uniform problems on 
multiprocessor systems based on the various models discussed above. Some of 
the most relevant strategies reported in the literature are discussed. These 
include: 
3.4.1 Randomization 
In this scheme, the destination processors for load transfer are chosen in 
a random fashion. These algorithms use local information to make movement 
decisions. A threshold value or sometimes called ideal load (IL) is calculated 
which decides, whether a processor is overloaded or underloaded. When a 
processor detects that there is load imbalance, a processor is randomly selected 
as a destination of load movement. In some algorithms instead of using only 
one threshold value, two thresholds values (LI and L2) are used to decide 
about the overloaded and underloaded processors. A processor is considered 
overloaded when its load becomes greater than LI. Similarly, Underloaded 
processors are those whose load is smaller than L2. 
3.4.2 Diffusion 
In this scheme the destination node is selected from a pool of neighbor 
nodes. A neighbor node is one which has a direct link to the source node. One 
simple method for dynamic load balancing, is to select the neighbor node for 
load transfer, if it is underloaded. In this way, a local load balance is achieved 
by migrating the surplus load. The surplus load can be interpreted as diffusion 
through the processors towards a balance state. Diffusion algorithms assume 
44 
that a processor is able to send and receive load to/from all its neighbor nodes 
simultaneously. If there is no underloaded processor amongst the neighbor 
nodes, then the nodes on the next levels are selected. In this way the method is 
iterative to solve the problem of diffusion. LeMair and Reeves classified the 
diffusion algorithms into two groups [LeMair and Reeves, 1993]: 
i) Sender Initiated Diffusion (SID): It is an approach which makes use of near 
neighbor load information to share out surplus load from heavily loaded 
processors to underloaded neighbor processors in the system. In other words, 
tasks from heavily loaded processors diffuse into lightly loaded areas in the 
system. Each processor acts independently and is limited to load information 
from within its domain, which consists of itself and its immediate neighbor. 
The underlying processor checks the load of its neighbor processors. If any of 
the neighbor processors has a load value smaller than the underlying 
processor's load, such processors are considered underloaded processors. Once 
the underloaded processors are identified, the underlying processor evaluates 
the load difference between itself and each of its underloaded neighbors. 
Subsequently, a fixed portion of the corresponding load difference is sent to 
each one of the underloaded neighbors. The difference is calculated based on 
the average load of the underlying processor such as ideal load (IL). All 
processors inform their near neighbors of their load levels and update this 
information throughout program execution. 
ii) Receiver Initiated Diffusion (RID): In RID, the underloaded processors 
are the active processors. These processors request load from the overloaded 
neighbor nodes in the system. The balancing process is initiated by any 
processor whose load becomes smaller than the prescribed threshold. However, 
upon receiving a load request, a processor will fulfill the request only up to an 
amount equal to half of its current load. The majority of overload in RID 
scheme lies on the underloaded processors. 
45 
3.4.3 Dimension Exchange Method (DEM) 
It is a global, fully synchronous approach for load balancing. Load 
balancing is achieved in an iterative fashion by "folding" an N processor 
system into log2 N dimensions and balancing one dimension at a time. This 
method was initially studied for hypercube topologies where processor 
neighbors are inspected by following each dimension of the hypercube 
[Cybenko, 1989]. The processors of a k-dimensional hypercube pair up with 
their neighbors in each dimension and exchange half the difference in their 
respective load. Then research has been reported on adapting a new and more 
efficient DE type algorithm named Generalized Dimension Exchange (GDE) 
strategy [Xu and Lau, 1992], [Xu and Lau, 1995]. Similarly, the Dimension 
Exchange on hypercube architecture with broken edges has been studied [Bahi 
et al., 2003]. This is the enhanced version of GDE and termed as Generalized 
Adaptive Exchange (GAE). A number of policies are further discussed that 
work on the principles of GAE [Bahi et al., 2005]. 
3.4.4 Gradient Model (GM) 
In this scheme [LeMair and Reeves, 1993], load is restricted to being 
along the direction of the most lightly loaded processors in the system. The 
basic procedure is that underloaded processors inform other processors in the 
system of their state, as a result the overloaded processors respond. That is, an 
overloaded processor will send its excess load only to one lightly loaded 
neighbor processor at the end of one iteration of the load balancing algorithm. 
The scheme is based on the two threshold parameters LI and L2. A processor 
is considered overloaded when its load becomes greater than L1, light if below 
the L2, and moderate otherwise. The Gradient Model scheme differs from the 
Dimension Exchange scheme in the sense that, in GM, the load information of 
the entire underlying domain is considered in deciding the destination 
processor, whereas in DEM only one processor is considered at each iteration. 
46 
In the Gradient model algorithm the first step is to determine the loading 
condition of each individual processor: light, moderate or heavy. The second 
step consists of establishing a system-wide gradient map to generate route 
between underloaded and overloaded processors. The gradient map is 
represented by the aggregate of all proximities. A node's proximity is defined 
as the minimum distance from itself to the nearest lightly loaded node in the 
system. Initially, all the nodes have proximity of Wmax, a constant equal to the 
diameter of the system. The proximity of a node is set to zero, if its state 
becomes light. Every node in the system calculates their proximities. A node's 
proximity may not exceed w a^x. A system is saturated, and does not require 
load balancing if all nodes report proximity of w a^x. If the proximity of a node 
changes it must notify its near neighbors. Hence, balancing process is initiated 
by lightly loaded processors reporting proximity of zero. If a processor's state 
is heavy and any of its neighbors report a proximity less than w a^x » then it 
sends a unit of its load to the neighbor of lowest proximity. The proximity map 
therefore is used to perform the migration phase. 
3.4.5 Hierarchical Balancing Method (HBM) 
It is an asynchronous and decentralized approach of load balancing 
[LeMair and Reeves, 1993]. It classifies the multi-computer system into a 
hierarchy of balancing domain. Each domain has a particular level of load 
balancing at different levels. Specific processors are designated to control the 
balancing operations at different levels of the hierarchy. The balancing process 
at different level is invoked by the receipt of load update messages indicating 
an imbalance between lower level domains i.e. processor in charge of the 
balancing process at a level Ij, receives load information from lower level, lj.i_ 
domains. Global balancing is achieved by ascending the network and balancing 
the load between adjacent domains at network level in the hierarchy. This 
procedure is asynchronous, however, where balancing is invoked within a 
domain whenever an imbalance is detected by the domain's specific processor. 
47 
Different imbalance thresholds can be set at different levels of the hierarchy. 
The HBM scheme distributes the load balancing responsibilities to all 
processors in the system. This scheme is effective to manage both the local 
load imbalance as well as excessive global imbalances. 
3.4.6 Minimum Distance Scheduling (MDS) 
This is another novel dynamic scheduling scheme for load balancing 
reported [Rafiq et al., 1999]. The algorithm operates on a minimum distance 
property which assures the minimization of the communication in distributing 
tasks among processors. In general, the performance of a multiprocessor 
system can be characterized by communication delay, distribution of load 
among the processors and scheduling overhead [Ravikanth et al., 1988], 
[Reddy, 1993]. Therefore, a close correspondence between the structures of the 
problem and the architecture of the processors is desired in order to minimize 
these overheads. When the problem graph topology is not known in priori, the 
mapping is done on the fly onto the processors. Thus, dynamic load balancing 
is essential for efficient utilization of highly parallel systems when solving non-
uniform problems with unpredictable load estimates. The scheduling 
techniques may have certain constraints that may vary from application to 
application. The MDS scheme works to minimize the communication in 
distributing tasks among processors. 
3.4.6.1 Minimum Distance Property. One of the important parameters for 
proper utilization of multiprocessor systems is the inter-processor 
communication costs which should be as small as possible. It necessitates some 
means to reduce these overheads. Therefore, to assign tasks on processors, a 
scheduling strategy must be designed which take care of minimization of 
execution and communication costs. Minimum distance is the property which 
assures the minimization of the communication in distributing subtasks and 
collecting partial results. Therefore, one of the method to sustain this property 
is to keep message path lengths to one hop. A scheduling scheme operates with 
48 
this property minimizes overhead and ensures the maximum possible speedup. 
The property may be formally stated as: "If T and T\ are the two tasks from a 
task tree of a given problem such that T is the parent of Ti and if P and Pi are 
the processors on which T and T] are scheduled, then, P should be directly 
connected to Pi in the network." In the MDS algorithm, the adjacency matrix of 
the network is used to satisfy the minimum distance property [Ravikanth et al., 
1988]. 
The general model of the dynamic load balancing is mainly based on the 
load balancing profitability determination at various sites in a multiprocessor 
Network [LeMair and Reeves, 1993]. Whenever, profitable, a scheduler is 
invoked which migrates tasks to achieve a more uniform distribution of load on 
processors. The donors (overloaded) and acceptors (underloaded) processors 
are identified based on a threshold value known as ideal load (IL). Each donor 
processor, during balancing, selects most suitable tasks (based on task 
dependencies) for migration thus maintaining minimum distance. Migration 
from donor processor is done to the directly connected acceptors. Thus, for 
every donor, there is a set of Minimum Distance Acceptors (MDA). Tasks are 
not allowed to migrate to acceptors which are outside this set. To perform the 
load balancing, the algorithm calculates ideal load (IL) value for each iteration, 
which is used by load balancer as a threshold to detect load imbalances and 
make load migration decisions. Mostly any load balancing algorithm considers 
the overall load at a processor. However, in this algorithm the load at a 
particular stage of the task structure is taken into consideration. The load 
imbalance factor for k'*' stage, denoted as LIF ,^ is defined as 
LIFk = [max {loadk (Pi)} - (idealjoad)k] / (ideal_load)k (3.1) 
where, 
(idealjoad)k = [loadk(Po) + loadk(Pi) +...+ loadk(PN-I)]/N, (3.2) 
and max (loadk(Pi)) denotes the maximum load pertaining to stage k on a 
49 
processor Pi ,0 < i < N-1, and Loadk (Pj) stands for the load on processor Pj due 
to k* stage. When implemented on Linearly Extensible Tree (LET), the MDS 
scheme shows that the network has good load balancing properties when 
considering problem structures having parallelism but non-uniform growth in 
various branches. The balancer uses the concept of balancing domains which 
reduces the overhead of the balancing process, but does not ensure a balanced 
load for entire system. This trade-off is illustrated in the scheduling strategies 
[Rafiq, 1995]. 
From the above review, it is apparent that myriad of multiprocessor 
scheduling strategies exist which can be applied to specific structure of 
programs and specific system architectures. An optimal scheduling can be 
made based on some objective functions to enhance the performance of overall 
system. 
In general the following remarks can be highlighted. 
• Static approaches are easier to implement and have minimal runtime 
overhead. However, these schemes are not applicable for parallel 
systems where computing resources and communication network/traffic 
are not known in advance. 
• Dynamic approaches result better performance but at the cost of high 
overhead. 
De-centralized schemes are costlier than centralized because it is very 
difficult to obtain and maintained the dynamic state information of the 
whole system by individual nodes. 
The total number of iterations on SID and RID policies required to 
achieve the global balancing are application and topology dependent. 
• 
50 
• RID strategy, on the other hand may be implemented easily to simpler 
topologies and can scale elegantly for large system. 
• The efficiency of the DEM and HBM strategies depends heavily on the 
system interconnection topologies. 
Motivated by the linear extensible properties of LET and its 
performance analysis when MDS scheduling schemes is applied on it, a new 
cube like interconnection topology named Linearly Extensible Cube (LEC) has 
been proposed and analyzed. The proposed architecture with topological 
properties is discussed in the next Chapter 4. A new dynamic scheduling 
scheme has been devised and implemented to evaluate the performance of the 
proposed LEC. The simulation results are discussed in Chapter 5. 
51 
CHAPTER 4 
Linearly Extensible Cube Network 
The rationale for using multiprocessor is to create powerful computers 
by simply connecting multiple processors (nodes). The demand for higher 
computation speed and the signs of saturation in integrated circuit technology 
has given a flip to the development in multiprocessor systems. The 
multiprocessor approach to parallelism is the most generalized and flexible 
one, but to great extent its success depends on the interconnection topology. In 
this approach, multiple nodes are used to work in parallel for a given program 
and reduce the total execution time. Multiple processors in such a system are 
attached to interconnection network. There are a large number of choices for 
interconnection networks, such as crossbar, butterfly, mesh, ring, multistage 
ring. Torus Ring, tree, hypercube, hypemet, completely connected, de Bruijn, 
LET and so on [Ganeshan and Pradhan, 1993], [Hamid and Hall, 1994], [Abdel 
and Khaled, 1998], [Rafiq et al., 1999], [Yoo et al., 2000], [Kwai and Parhami, 
2004]. There has been a lot of research for designing the appropriate topology 
of interconnection networks for massively parallel computer systems. 
However, there is no consensus on the best network organization [Kim and 
Veidenbaum, 1999]. Many large-scale multiprocessor systems have been 
developed with their own topologies [Kwak and Jhon, 2007]. Nevertheless, the 
basic goal is that a particular topology should have excellent properties such as 
regularity, scalability, small diameter, high connectivity, high fault tolerance. 
52 
low degree and small link complexity. Therefore, the choice of the topology of 
the interconnection network is critical in the design of a traditional parallel 
system and it may affect the overall system performance. 
Study of parallel computer interconnection topology has emphasized 
wide use of cube based topologies. A variety of cube based multiprocessor 
networks have been reported, which exhibit the excellent properties of such 
multiprocessor systems such as regularity, symmetry, small diameter, low 
degree and good scalabilty [Esfahanian and Sagan, 1993], [Ghose and Desai, 
1995], [Zhang, 2002]. Many interconnection networks such as trees and 
multidimensional meshes can be embedded in the cube. Motivated from the 
above discussion, a new multiprocessor network named as Linearly Extensible 
Cube (LEC) network has been proposed which exhibits the desirable properties 
of similar topological networks. 
In this chapter, an analysis of the said network, its various properties 
and a brief comparision with other existing network such as hypercube, de 
Bruijn and Linearly Extensible Tree (LET) has been given in tabular form. 
4.1 Multiprocessor Interconnection Networks 
A suitable interconnection network is an integral part of any massively 
parallel system. The network is often modeled as undirected or directed graphs. 
The nodes (vertices) of such a graph represent the processing elements and the 
edges (arcs) denote the bidirectional communication channels/links. The length 
of a path between two nodes is the number of edges encountered in the path. 
The diameter of a network is the largest distance between two nodes. The 
degree of a network is the largest degree of all nodes in the network 
Extensibility is the property which facilitates constructing large-sized systems 
out of small-sized systems with minimum changes in the configuration of a 
node of the system. Interconnection topologies are evaluated in terms of small 
diameter, low degree, simple extensibility and high fault tolerance. The 
53 
bisection width is another parameter to assess the performance of the network. 
It is the minimum number of edges required to be remove when a given 
network is cut into two halves. A high bisection width is desirable in the 
interconnection network. Many of these parameters make contradictory 
demands and therefore, a compromise is there in the design of the network. As 
the proposed network is a modification of the hypercube network hence a brief 
description of the hypercube network is given first in the next section. 
4.2 Hypercube Network 
The hypercube represents a class of message-passing architectures using 
cube (or exchange) interconnection topology. Hypercube networks are some of 
the first and most successfiil commercial multiprocessors. Each node in this 
network is connected through bidirectional, asynchronous point-to-point 
communication links to other nodes. 
An n-dimensional hypercube multiprocessor consists of N = 2" 
processors. Each processor is labeled by a different n-bit binary number (bn.i 
bn.2 ... bi bo). Two processors are connected directly by a link if, and only if, 
their binary labels, differ exactly one bit position. The connection scheme 
places the processors at the vertices of an n-dimensional cube. Hypercube 
interconnection networks for different dimensions from 1 to 4 are shown in 
Figure 4.1. The hypercube has the property that it can be defined inductively. A 
hypercube of order 0 is a single node, and the hypercube of the order n+1 is 
constructed by taking two hypercubes of order n and connected their respective 
nodes. Figure 4.1 shows the hypercube architectures for n = 1, 2, 3, and 4. A 
one-cube architecture has n = 1 and 2" = 2 nodes interconnected by a single 
path. A two-cube architecture has n = 2 and 2" = 4 nodes interconnected as a 
square. A three-cube architecture has eight nodes interconnected as a cube and 
soon. 
54 
^f^-
1 
0 
0 
n =1 n =2 n =3 
n =4 
Figure 4.1: Hypercube interconnections 
55 
The reason for the popularity of the hypercube network can be attributed 
to its topological properties. Some important properties of this interconnection 
used in parallel computers are given below [Saad and Shultz, 1988], [Ganeshan 
and Pradhan, 1993], [Mano, 2003], [Hesham and Mustfa, 2005]: 
1. The diameter of hypercube with 2" nodes is n i.e. the logarithmic of the 
number of nodes in the network. The node degree (number of edges per 
node) of hypercube is also equal to n. The diameter and the node degree 
increase as the number of nodes in the hypercube network increase. The 
hypercube has a high bisection width b ^ 2"'' and has good capability of 
fault tolerance. 
2. A hypercube is a super set of other interconnection networks such as 
rings, multistage cube networks, meshes, trees etc. because these can be 
embedded into a hypercube by ignoring some hypercube connections. 
3. Hypercubes have simple routing schemes. A message-routing policy 
may send a message to the neighbor whose binary tag agrees with the 
tag of the final destination in the next bit position, with the bits scanned 
in some order. The path length for sending a message between any two 
nodes is exactly the number of bits in which their tag bits differ. 
Numerous possible paths connecting any two nodes exist in the network, 
which produce a large communication bandwidth. For example, 
referring to Figure 4.1, in a three cube structure (n=3), node 000 can 
communicate directly with 001. It must cross at least two links to 
communicate with Oil (from 000 to 001 to Oil or from 000 to 010 to 
Oil). Similarly, it is necessary to go through at least three links to 
communicate from node 000 to node 111 [Hwang and Briggs, 1985], 
[Bhuyan and Agrawal, 1984], [Mano, 2003]. 
4. The hypercube has poor scalability and it is difficult to package higher-
dimensional hypercubes. In other words, when adding some few nodes 
56 
to the network, the network size must be duphcated to reach to the next 
specified network size [Ghose and Desai, 1995]. 
4.3 Linearly Extensible Cube (LEC) Multiprocessor Network 
4.3.1 Design and Analysis 
The LEC network grows Unearly in a cube like shape. The network itself 
is recursively connected and is defined through connection functions in a 
manner similar to that of cube connection. Let Q be a set of N identical 
processors, represented as 
Q={Po,Pi,P2, PN-I} 
The number of processors N in the network is given by 
N=2*n (4.1) 
Where, n is the level or depth of the network (n e Z and n > 0). For different 
levels, network is having even numbers of processors. For n = 1, an LEC 
architecture of two processors interconnected by a single path can be obtained. 
For second level (n = 2), the network is having 2*n = 4 interconnected 
processors. Similarly, for n = 3, the LEC network has 2*n = 6 interconnected 
processors. 
In order to define the link functions we denote each processor in the set 
Q as Pin, n being the level in server where the processor Pi resides. As per 
extension policy, only two processors exist at level n. Thus at level 1, PQ and P] 
exist and at level 2, P2 and P3 exist and so on. The arrangement is shown in 
Figure 4.2. 
57 
Poi Pii 
P22 P32 
P43 P53 
Figure 4.2: Arrangement of processors in LEC network 
Let Q' be the set of designated processors of Q, thus 
Q'-{Pi}, 0<=i<=N-l 
The link function Li and L2 define the mapping from Q' to Q as 
L, (P.) = P(i+i)modN 
L2(Pi) = P(i+2)modN,ForallPiinQ' (4.2) 
The two fimctions Li and L2 in Equation (4.2) indicate the links between 
various processors in the network. These link functions can also be 
demonstrated by the adjacency matrix of order N x N, where N is the number 
of processors. Figure 4.3 (a) and Figure 4.3 (b) show the proposed network for 
six processors and its adjacency matrix respectively, where T indicates a 
connection and '0' indicates no connection between nodes. 
58 
Figure 4.3 (a): The LEC architecture with six processors 
PO 
PO 
" 0 
PI 
1 
P2 
1 
P3 
0 
P4 
1 
P5 
1 
PI 1 0 1 1 0 1 
P2 1 1 0 1 1 0 
P3 0 1 1 0 1 1 
P4 1 0 1 1 0 1 
P5 
_ 1 1 0 1 1 0 
Figure 4.3 (b): Adjacency matrix for Figure 4.3 (a) 
59 
4.3.2 Properties of the LEC Network 
Here some properties of the LEC network have been compared with 
hypercube, de Bruijn [Samathan and Pradhan, 1989], [Quinn, 2002] and LET 
[Rafiq et al., 1999] networks. These properties help to understand the 
effectiveness of a particular organization. The various properties are: 
• Number of Nodes (N): The number of nodes in a multiprocessor network 
plays an important role to evaluate the performance of a multiprocessor 
system. Lesser the number of nodes, lesser is the system complexity and it 
is more economical. Therefore, number of nodes should be optimal. The 
number of nodes in LEC network is N = 2*n for n > 0, whereas, the number 
of nodes in the hypercube and de Bruijn network is 2". In LET network, 
n 
N=]^k where, n is the depth or level of the network. Due to lesser number 
k=l 
of processors in the LEC network, it may be considered more economical 
than other networks. 
• Diameter (D): The diameter of a network is the measure of the maximum 
inter-node distance in the network. This property is important in 
determining the distance involved in communication and hence the 
performance of multiprocessor systems. In simple words diameter of a 
network is the maximum shortest path between source and destination node. 
The diameter of a network is bound to increase as the size grows unless 
there is no limit on the number of links. 
In the case of de Bruijn, hypercube and LET the diameter increases by 
one as the number of processors is ddubled. Ignoring the fold-back connections 
the diameter of the LEC network also increases by one on each extension. 
Table 4.1 shows, the diameter of various multiprocessor networks for different 
levels. These results have been obtained using shortest path algorithm. In case 
of LEC network, it is observed that the diameter does not always increase with 
60 
the addition of a layer of processors. It may be highlighted that the diameter of 
LEC shows a maximum value of 10 for 40 processors, which is lesser in 
comparison to other networks as given in Table 4.1. 
The diameter in hypercube and in de Bruijn networks increases linearly 
by increasing the number of processors. In case of LEC and LET, the 
increment in diameter is not linear, however LEC has smaller diameter as 
compared to LET. This trend is depicted in Figure 4.4. 
14^ 
12 -
10 
E 
m 
O 
8 
6 
4 
? 
DIAMETER OF VARIOUS SIZE NETWORKS 
6 7 8 
Level Number 
10 11 
•LEC -LET - Hypercube - de Brujjn] 
12 13 
Figure 4.4: Comparison of Diameter of different multiprocessor networks 
61 
o 
c 
l-H 
o 
(U o o 
a. 
T3 (U 
N 
a 
> 
o 
u i 
H 
o o 
r-
00 
o 
o VO o 
(N 
m o 
0 0 
0 0 ( N l O 
::: 0 0 
00 
o VO 0\ ^ 
o o 
O >n 0 0 
o 
o oo i n 0 0 ON 
0 0 V£) 
CN 
• * r-- 0 0 
r- T t 
0 0 
• * >o r~ 
VO ( N 0 0 m m ^ 
l O O ?S CN m • * i n 
r l - OO i n ^ f S •<t '^ 
m \o o 0 0 ( N m m 
( N • * VD • * - ( N ( N 
- ( N m ( N - - -
O o - — o O O 
13 
> 
P 
PJ 
0) c 
3 2 
D. U 
IT C 
^ ^ C 
/—\ 
O 
w 
H 
U 
/—s 
aj c 
<^ ii 
Uri CD 
4) 
D . D 
E C 
O c 
O t . 
1 s 1 2 Z a. 
• 4 - * 
•Q 62 
Degree (d): The degree or connectivity of a node in a network is defined as 
the nximber of connections required at each node. The connectivity of the 
nodes determines the complexity of the network. The higher the 
connectivity, the higher the hardware complexity and hence the cost of the 
network. Therefore, the node degree should be kept as low, as possible in 
order to reduce cost. It is best if the number of edges per node is a constant 
independent of the network size, because in that case the processor 
organization scales more easily to systems with large number of nodes. The 
degree of node in the proposed network is always 4. The connectivity of 
LET and of de Bruijn networks is also 4 or less, whereas, the connectivity 
of hypercube increases with the size. 
Extensibility: It is the property which facilitates large sized system out of 
small ones with minimum changes in the configuration of the nodes. It is 
the smallest increment by which the system can be expanded in a useful 
way. In the proposed network, the number of processors increases in a 
constant manner because each extension requires single layer of 2 nodes 
and no additional node is required at any extension. If we compare the 
extensibility of LET, the extension complexity increases linearly because 
each extension requires adding a single layer of (n+1) nodes. Therefore, at 
higher levels, the number of nodes becomes large and complexity may 
increase. Similarly, the hypercube and de Bruijn networks though are 
extensible but the complexity increases exponentially by the power of 2. 
The constant growth of the proposed network makes the extension less 
costly. Besides, the LEC network can be extended in two directions, 
vertically upward and vertically downward and a chain of the network could 
be formed which is not available in the case of other networks. Figure 4.5 
shows the extensibility of LEC network in both the directions. 
63 
Vertical^ Upward 
Extensibility 
Vertically Downward 
Extensibility 
Figure 4.5: Extensibility of LEC network 
64 
• Bisection Width (b): The bisection width of a network is the minimum 
number of edges that must be removed in order to divide the network into 
two halves (within one). High bisection width is better, because in 
algorithms requiring large amounts of data movements, the size of the data 
set divided by the bisection width puts a lower bound on the complexity of 
the parallel algorithm. The hypercube and de Bruijn networks have a high 
bisection width equal to 2""' and Tin respectively. The LET network has a 
bisection width equal to 21og2(n-2), whereas, the LEC has a bisection width 
equal to N. 
• Fault Tolerance: As more and more processors are incorporated into 
parallel machines, the size and the complexity of the network increases. If a 
fault in such a complex system occurs, it is required that all the active 
processors should remain as a connected component and be able to 
undertake part in significant parallel computation. In the proposed network, 
the bisection width is directly proportional to the number of processors 
available in the network. Therefore, the bisection width also increases with 
the increase in the network size. Thus, the proposed LEC network has a 
better capability of fault tolerance. 
Table 4.2 summarizes the comparison of the above characteristics of 
various processor intercormection networks, which shows the superiority of the 
proposed network. 
65 
Table 4.2: Summary of parameters for various multiprocessor networks 
Parameter Hypercube de Bruijn LET LEC 
No. of 
processor N=2" N=2" 
N=±k 
k=l 
N=2*n 
Degree n 4 4 4 
Extensibility 2" 2" n+1 2 
Diameter 0 (logjn) 0 (log2n) 0 ( V N ) LNJ 
Bisection 
Width 2"-' 27n 21og2(n-2) N 
In conclusion, it may be said that a new network topology (LEC) for 
multiprocessor systems has been proposed as an attempt to combine some 
desirable features of linearly extensible structures and compact hypercube or de 
Bruijn structures. The proposed architecture exhibits better connectivity, lesser 
number of nodes, lesser diameter and a constant extension of two nodes at each 
level over hypercube and de Bruijn networks. Therefore, the proposed LEC 
architecture may be considered as a low cost multiprocessor architecture. 
In the next chapter, a dynamic scheduling scheme is described which 
takes into account the adjacency matrix of network interconnections for 
migration of load on the various processors of the network. To evaluate the 
performance, the proposed scheme is implemented on LEC by simulating 
different types of load. The performance of LEC network is compared by 
implementing the proposed scheme on other multiprocessor architectures 
discussed above. 
66 
CHAPTER 5 
Performance Measure Strategies 
The performance of a multiprocessor network is not only dependent on 
its interconnecting pattern but also on how the load is distributed on different 
nodes. Scheduling of load (tasks) and resource management are therefore, 
important issues when optimizing the performance of a multiprocessor 
network. The performance of a multiprocessor system depends upon the 
effective utilization of processing elements (nodes). In such systems, if some 
nodes remain idle while other are extremely busy, system performance will be 
degraded. Therefore, to make efficient use of a multiprocessor network it is 
necessary to distribute load in a manner which keeps all the processors equally 
busy. Assignment of tasks among various processors in the network, and 
optimal utilization of the resources commonly referred to as scheduling. The 
main objective of scheduling scheme is to speed up the execution of 
applications especially when the workload varies at run time in an 
unpredictable way. 
Mapping applications to parallel machine and balancing load on parallel 
processors is indeed a very difficult task and has been addressed as research 
problem. Many load balancing algorithms have been proposed and 
implemented on massively parallel computers [LeMair and Reeves, 1990], 
[Kumar et al., 1991], [Ishfaq and Ghafoor, 1991], [LeMair and Reeves, 1993], 
[Hortron, 1993], [Cortes et al., 1999], [Hsu et al., 2000], [Salleh et al., 2002], 
67 
[Doulamis et al., 2007]. These algorithms are classified in a number of ways 
such as static/dynamic, global/local, sender/receiver initiated and/or 
synchronous/asynchronous. These classifications have already been discussed 
in Chapter 3. The important point is that in a multiprocessor system where 
status of the load changes frequently and in an unpredictable way, a dynamic 
approach gives better performance [Zaki et al., 1997]. A dynamic algorithm 
allocate/reallocate load at runtime based on no a priori task information, which 
may determine when and which tasks can be migrated. A novel dynamic 
scheduling scheme namely Minimum Distance Scheduling (MDS) for load 
balancing has been reported [Rafiq et al., 1999]. The algorithm utilizes a 
minimum distance property. 
In this chapter, first the MDS algorithm is described in brief An 
extension of the MDS scheme is discussed and based on MDS a modified 
dynamic scheme named as Two Round Scheduling (TRS) is proposed. The 
TRS scheme has been implemented on the proposed LEC as well as on some 
other some standard reported topologies. In addition to this, other reported 
dynamic scheduling schemes namely Hierarchical balancing Method (HBM) 
and Gradient Model (GM) [LeMair and Reeves, 1993] have also been 
implemented on LEC network. 
5.1 Minimum Distance Scheduling (]VIDS) Scheme 
The performance of a multiprocessor system can be characterized by 
communication delay, distribution of load among the processors and 
scheduling overhead. The MDS scheme is based on the principle of minimum 
distance feature. Minimum distance is the property which assures the 
minimization of the communication in distributing subtasks and collecting 
partial results. A scheduling scheme operates with this property minimizes 
overhead and ensures the maximum possible speedup. 
68 
In the MDS scheme, the adjacency matrix of the network is used to 
satisfy the minimum distance property. A 'one' in the matrix indicates a link 
between two nodes whereas a 'zero' indicates there is no Unk between nodes. 
Similarly, the LEC network and its adjacency matrix are shown in Figure 5.1. 
For load balancing, the MDS algorithm determines the value of Ideal Load (IL) 
at various stages of the load (task generation). IL is calculated by summing the 
load of each node in the network divided by the total number of nodes available 
in the network. The processors having a load value greater than the IL are 
considered as overloaded processors. Similarly, processors having lesser load 
than the value of IL are termed as underloaded processors. In other words the 
overloaded (donors) and underloaded (acceptors) processors are identified 
based on a threshold value known as IL. Each donor processor, during 
balancing, selects tasks for migration to the various connected and underloaded 
processors (i.e. the processors having a 'one' in the adjacency matrix) and thus 
maintaining minimum distance. Mostly any load balancing algorithm considers 
the overall load on the network. However, in this algorithm the load is mapped 
through various stages of the task structure. Each stage represents a particular 
state of the task structure which consists of finite number of tasks. The 
implementation of MDS on LET shows that the network has good load 
balancing capabilities [Rafiq et al., 1999]. 
0 1 1 0 1 1 
1 0 1 1 0 1 
1 1 0 1 1 0 
0 1 1 0 1 1 
1 0 1 1 0 1 
1 1 0 1 1 0 
Figure 5.1 (a): The LEC network Figure 5.1 (b): Adjacency matrix for LEC 
69 
5.2 Two Round Scheduling (TRS) Scheme 
The basic approach in MDS is to optimize the load balancing among 
processors under the constraint of the need to keep message path lengths to one 
hop and thus satisfying the minimum distance property. Migration from donor 
processor is done to the directly connected acceptors only. Thus, for every 
donor, there is a set of Minimum Distance Acceptors (MDA). Tasks are not 
allowed to migrate to acceptors which are outside this set. Referring to Figure 
5.1 of LEC network, MDA (PQ) - {Pi, P2, P4, P5}, which indicates that even if 
the processor P3 is underloaded, it would not be considered as a part of the 
balancing process. Therefore, a more dynamic nature of algorithm is required 
to make the networks fully balanced, which takes into consideration those 
processors also, which are not directly connected. 
A new scheme has been proposed for solving load balancing problem 
with unpredictable load estimates. The proposed algorithm works as an 
extension of MDS and named as Two Round Scheduling scheme. It is dynamic 
in the sense that no priori knowledge of the load is assumed. TRS scheme takes 
into consideration those acceptor nodes which are not connected directly to 
donor node. There may be more than one path between the donor and acceptor 
processors which require multi-hop. However, large number of hopes gives 
minimum load imbalance and hence, LIF is smaller (i.e. less than the standard 
range of 40%). The proposed TRS algorithm has a constraint in the scheduling 
to consider only one processor as intermediate node between donor and 
acceptor nodes. To perform the load balancing, the algorithm calculates ideal 
load value for each stage of the task structure, which is used as a threshold to 
detect load imbalances and make load migration decisions. The load imbalance 
factor for k* stage, denoted as LIFk, is defined as given in Equation (3.1) of 
Chapter 3, which is: 
70 
LIFk= [max {loadk(Pi)} - (ideal_load)k] / (ideal_load)k 
where, (ideaI_Ioad)k = [loadk(Po) + loadk(Pi) +...+ loadk(PN-i)]/N, 
and max (loadk(Pi)) denotes the maximum load pertaining to stage k on a 
processor Pi ,0 < i < N-1, and loadk (Pj) stands for the load on processor Pj due 
to k"' stage. Each stage of the task structure (load) represents a finite number of 
tasks. Based on the IL value, the donor (overloaded) processors and acceptors 
(underloaded) processors are identified. Migration of task can take place 
between donor and acceptor processors only. 
The scheme may be defined in the following five steps: 
i) Map the tasks at the root processor and calculate IL at a particular 
stage of the task structure (Load). 
ii) Transfer the load onto various available processors in the network. 
Check the load of each processor to identify the donors and acceptors 
processors. 
iii) Check the connectivity of donor and acceptors with the help of 
adjacency matrix and migrate tasks fi"om donor to acceptors to make 
the connected processors balanced. 
iv) If no direct connection is available between donor and acceptor, then 
find the alternative path by considering only one intermediate node, 
and perform migrations to make the network fully balanced. 
v) Repeat the above procedure for the next stage. 
The whole algorithm is implemented in ' C language. A pseudo code of 
the algorithm is shown in Table 5.1. 
71 
TABLE 5.1: The TRS Algorithm 
trs() 
{ 
/* Generate task at 0* processor, tgs indicates task generation at a particular stage*/ 
/* Consider LMAX is the maximum load on a processor at a particular load stage */ 
tgs[0] = l; 
while (it_countl < LMAX) 
{ 
/* calculate IL and RIL */ 
IL = CalculatelL (tgs); 
RIL = ceil (IL); 
printf(tgs); 
/* For all processors check whether the load on a particular processor is exceeding 
the RIL (Rounded IL). If so then migrate the load*/ 
/* Let the total number of processors are equal to PMAX */ 
for (it_count2 = 0; it_count2 < PMAX; ++ it_count2) 
{ 
if (tgs [it_count2] >RIL) 
{ 
/* Migrate till load at processors become equal to or less then RIL */ 
while (true) 
{ 
migrate (it_count2) 
if (tgs [it_count2] < = RIL ) break; 
} } } 
printf (trs) 
/* calculate LIF */ 
LIF = (max(tgs) - IL) / IL; 
/* Enter into the next level of the task generation (ts indicates task structure)*/ 
72 
tgs = ts * tgs; 
it_ count -H-; 
} } 
/* Functions used by the algorithm */ 
CalculateJL (X[ ]) 
{ 
sum = 0; /* x[i] indicates load at i* processor */ 
for (i = 0; i < PMAX; ++i) 
sum = sum + x[i]; 
return (sum / PMAX); 
} 
/* Perform migrations */ 
migrate (pnumber) 
{ 
/* Get the set of cormected processors to the processor for which migration is being 
called i.e. pnumber */ 
for (i =0; i < PMAX; ++i) 
{ 
if (connect ed (i, p_number, level)) 
temp [k++] = i ; 
k--; 
} 
/ * Get the small loaded processor number */ 
small = temp [0]; 
for (i - 0 ; i < PMAX; ++i) 
if (tgs [temp[i] ] < tgs [small]) 
small = temp [j]; 
/* Transfer the load from pnumber to the smallest loaded and connected processors 
*/ 
while (tgs[p_mumber] != IL || tgs[small] != IL) 
{ 
73 
tgs [pnumber] —; 
tgs [small]+ =1;} 
} 
/* Check the under loaded processors which are not connected. If any repeat the 
above procedure for the next level of connectivity */ 
} 
/* Function used to find the maximum load on a processor */ 
max (X [ ] ) 
{ 
max = X [0]; 
for (i =0; i < PMAX; ++ i) 
if (x [i] > max) max = a [i] ; 
return (max); 
} 
/* Function to check the connectivity of processor i with processor j . Assume the 
level of connectivity is given (1 or 2)*/ 
int connected (int i, int j , int level) /* returns true if processors i, j are connected */ 
{ 
/* printf("\n node %d is connected to %d: %d", i, j , adj [i]0]); */ 
if(level = = l ) 
return adj [i][j]; 
for(int k = 0; k < PMAX; k-H )^ 
{ 
if (k = = i II k = = j) continue; 
if (connected (i ,k, 1) && connected (k, j , 1 )) 
{ 
/* printf("\n node %d is connected to %d through %d", i, j , k); */ 
return 1; 
} } 
return 0; } 
end of procedure 
74 
The above algorithm has been implemented on LEG and other 
multiprocessor networks for different types of load. The simulation results are 
discussed in the next section. 
5.3 Simulation Results 
The above mentioned TRS scheme has been implemented on IBM 
server X series 226 having Intel Xeon 3.0 GHz processor in the same 
environment. The simulation run consists of generating tasks and executing 
them on the network of processors i.e. on six processor LEG network under the 
proposed Two Round scheduling scheme. The results are computed based upon 
the various types of load generation. In particular, the performance is evaluated 
for uniform load as well as for non-uniform load (or random load) generation. 
To evaluate the performance, the average percentage of LIP is computed, 
which indicates the load imbalance after a balancing action at each stage of the 
task structure. 
5.3.1 Dynamic Load Model 
The key advantage of massively parallel systems is to allow concurrent 
execution of workload characterized by computation units known as processes 
or tasks, which can be independent programs or partitioned modules of a single 
program. A task is a complete sequence of instructions. Task execution starts 
when a task is selected by task scheduler and one of the system's nodes starts 
to run task's instructions. Tasks may be classified according to their deadline, 
priority and their arrival characteristics. For the purpose of simulation we 
assume a simple problem characterization in which the load is partitioned into a 
number of tasks. All tasks are independent and may be executed on any 
processor in any sequence. 
The scheduling performance of the strategy has been tested on the LEG 
network by simulating artificial dynamic load. In order to simulate the load on 
the proposed network, it is characterized into two groups of task structures i.e. 
75 
uniform and non-uniform load. For a meaningful simulation, tree structures that 
forms a representative sample of programs are needed which are to be executed 
on the network. The tree is considered as a test problem on which the schemes 
are to be applied. In case of uniform load, tasks are generated in a deterministic 
manner in the form of a regular tree. Each node of the tree represents a task, 
and executed in parallel in breadth-first maimer starting from the root task 
which is assigned to some given nodes of the network. The total number of 
nodes in the task tree at a level represents a particular stage of the load. 
In order to characterize non-uniform load (non-deterministic load), the 
total problem is conceived to be an arbitrary tree which unwind itself level by 
level. A task scheduled on a processor spawns an arbitrary or random number 
of subtasks, which are part of the whole problem tree. Thus the load on each 
processor is varying at run time creating unbalance, and balancer/scheduler has 
to be invoked after each stage. The generation of randomly created tree is 
governed by two random variables 'Spawn' and 'Fanout'. The random variable 
'Spawn' decides whether a node should be a leaf node or an internal node and 
the random variable 'Fanout' decides the number of children a node should 
have. A tree is built in a breadth-first manner, starting from the root node. 
By repeated application of the following operations on the nodes at the 
lowest level of the partially constructed tree, a tree structure is generated up to 
a pre-specified level (stage): 
If the stage of the node is equal to the pre-specified stage 
then the node is a leaf node 
else if the values assumed by the random variable Spawn is zero then 
the node is a leaf node 
else the node is an internal one and the number of children it has equal to 
the value assumed by the random variable fanout for that node. 
76 
In a tree thus generated, each node represents a task. The tree is 
considered as a test problem on which the schemes are to be applied. The 
experiments have been based upon various types of randomly generated tree 
structures (non-uniform load). The two categories of such structures are: 
i) random binary tree structures (tree structures having a maximum fanout 
of two which is uniformly distributed over the range {1,2}, and 
ii) random ternary tree structures (tree structures having a maximum fanout 
of three which is uniformly distributed over the range {1,2,3}. 
The task tree grows as tasks randomly spawn new tasks and scheduler 
schedules them onto neighboring processors (or itself) as per the scheduling 
rules. A task after spawning sub-tasks enters into a wait state. A waiting task 
becomes executable at a later point of time, when all of its sub-tasks have 
completed execution. An executable task on being selected by the processor, 
executes to produce a result packet. The result packet is then forwarded to its 
destination. 
Using the above pattern of task structures (load), the performance of the 
network has been tested for TRS scheme as well for other dynamic scheduling 
schemes. The performance is measured in terms of load imbalance left after a 
balancing action at each stage of the load. The above simulation has been 
performed on LEC and other multiprocessor networks using IBM server X 
series 226 having Intel Xeon 3.0 GHz processor. 
5.3.2 TRS Scheme on LEC Network 
To study the behavior of the proposed Two Round Scheduling (TRS) 
scheme on the LEC network, the load is generated based upon the different 
stages of the task structures and the balancing action takes place for every 
stage. A particular stage of the task structure represents some fixed amount of 
tasks. Table 5.1 and 5.2 shows the sample output of computer generated 
77 
Table 5.2: Load Migration Table for uniform Load on LEC (TRS) upto stage 7 
(Sample Output 1) 
Po Pi P2 Pa P4 P5 
TGS [1]: 
TRS [1]: 
1 
1 
0 
0 
0 0 
0 0 
0 
0 
0 
0 
IL = 0.166667 RIL = 1.000000 LIF = 0.000000 TT = = 1 
TGS [2]: 
TRS [2]: 
2 
1 
0 
1 
0 0 
0 0 
0 
0 
0 
0 
IL = 0.33333 RIL = 1.000000 LIF = 0.000000 TT =  2 
TGS [3]: 
TRS [3]: 
2 
1 
2 
1 
0 0 
1 0 
0 
1 
0 
0 
IL = 0.66667 RIL = 1.000000 LIF = 0.000000 TT =  4 
TGS [4]: 
TRS [4]: 
2 
2 
2 
2 
2 0 
2 0 
2 
2 
0 
0 
IL= 1.33333 RIL = 2.000000 LIF = 50.00025 TT =  8 
TGS [5]: 
TRS [5]: 
4 
3 
4 
3 
4 0 
3 2 
4 
3 
0 
2 
IL = 2.66667 RIL = 3.000000 LIF = 12.49985 TT = 
 16 
TGS [6]: 
TRS [6]: 
6 
6 
6 
6 
6 4 
6 4 
6 
6 
4 
4 
IL = 5.33333 RIL = 6.000000 LIF =12.50000 TT = 
 32 
TGS [7]: 
TRS [7]: 
12 
11 
12 
11 
12 8 
11 10 
12 
11 
8 
10 
IL= 10.66666 RIL = 11.000000 LIF = 3.12500 TT = 64 
78 
Table 5.3: Load Migration Table for non-uniform Load on LEG (TRS) upto stage 7 
(Sample Output 2) 
Po Pi P2 P3 P4 Ps 
TGS [1]: 
TRS [1]: 
1 
1 
0 
0 
0 0 
0 0 
0 
0 
0 
0 
IL = 0.166667 R I L - 1.000000 LIF = 0.000000 TT = = 1 
TGS [2]: 
TRS [2]: 
1 
1 
0 
0 
0 0 
0 0 
0 
0 
0 
0 
IL = 0.166667 RIL = 1.000000 LIF = 0.000000 TT = = 1 
TGS [3]: 
TRS [3]: 
1 
1 
0 
0 
0 0 
0 0 
0 
0 
0 
0 
IL = 0.166667 RIL = 1.000000 LIF = 0.000000 TT = = 1 
TGS [4]: 
TRS [4]: 
2 
1 
0 
1 
0 0 
0 0 
0 
0 
0 
0 
IL = 0.33333 RIL = 1.000000 LIF = 0.000000 TT =  2 
TGS [5]: 
TRS [5]: 
2 
1 
0 
1 
0 0 
0 0 
0 
0 
0 
0 
IL-0.33333 RIL = 1.000000 LIF = 0.000000 TT =  2 
TGS [6]: 
TRS [6]: 
2 
1 
2 
1 
0 0 
1 0 
0 
1 
0 
0 
IL = 0.66667 RIL = 1.000000 LIF = 0.000000 TT =  4 
TGS [7]: 
TRS [7]: 
2 
1 
2 
1 
0 0 
1 0 
0 
1 
0 
0 
IL = 0.66667 RIL = 1.000000 LIF = 0.000000 TT =  4 
79 
progress of load migration for uniform and non-uniform load respectively on 
LEC network of six processors (upto stage 7). In each row the entries are: 
processors (donors and accepters), TGS (tasks generated at a particular load 
stage), TR schedule, IL, Rounded IL (RIL), LIF (%) and total tasks (TT) 
available at a particular stage of load. 
The behavior of load imbalance is evaluated for both the above 
mentioned types of load. The average value of LIF is obtained and the curves 
are plotted as the average percent LIF against the load at various stages (i.e. the 
problem size) shown in Figure 5.2 & 5.3. 
LIF FOR UNIFORM LOAD (IRS) 
35 . 
30 
-^ 25 -I g. ! 
^ 2 0 . 
2 15 
> 
< 10 I 
4 5 6 7 8 9 10 
Load at various stages 
Figure 5.2: Performance of LEC network for uniform load 
80 
LIF FOR NON UNIFORM LOAD (TRS) 
30 
25 
20 
o 15 
D) 
m 
0) 
< 10 
5 6 7 8 9 10 11 12 13 14 15 
Load at various stages 
Figure 5.3: Performance of LEC network for non-uniform load 
The trend of curves obtained in Figure 5.2 indicates the behavior of the 
load imbalance factor with respect to the load at various stages for uniform task 
structures. It is observed that LIF initially rises from zero to its high value and 
then reducing asymptotically. When the number of tasks at a particular stage is 
lesser than the number of nodes, the LIF shows a high value and hence a high 
load imbalance. However, as the number of tasks increases, the LIF starts 
reducing (as balancing activity starts its effect) and finally approaches to zero 
value. For non-uniform load (Figure 5.3) the value of LIF starts from zero, 
reaches to its peak and remains same for several stages of the tasks generation. 
The reason for this high value of LIF for several load stages may be due to 
imbalance of load which results due to unpredictable load, which is smaller 
81 
than the number of processors in the network during these stages. In this 
situation some of the processors may remain idle and hence lacks the efficient 
utilization of the processing elements. On the other hand when sufficient 
numbers of tasks are available, the LIF starts reducing. This reduction however, 
is not that smoother as in the case of uniform load. 
5.3.3 TRS Scheme on other networks 
To draw general conclusion about the effectiveness of the proposed TRS 
scheme, it has been implemented on other reported multiprocessor networks 
under the same environment. The simulation run consists of generating various 
types of load and mapping them on the six processors Linearly Extensible Tree 
(LET), eight processors hypercube and de Bruijn networks. The estimation of 
LIF is obtained for various stages of the tasks structures in the same way as that 
on LEG network and the curves are plotted as the average percent LIF against 
the load for different stages shown in Figure 5.4 & 5.6. 
i ' ' 
LIF FOR UNIFORM LOAD (TRS) 
60 
50 ^ 
4 5 6 7 8 9 10 
Load at various stages 
•LEG —•—LET • de-Bruijn X Hypercube 
Figure 5.4: TRS scheme on various multiprocessor networks 
82 
The trends of curves obtained in Figures 5.4 indicate the behaviors of 
the load imbalance factor with respect to the load at various stages on various 
multiprocessors networks for uniform task structure. It is clear from the curve 
that initially the value of LIF starts from zero and reaches to its maximum 
value for all the multiprocessor networks. This high value of LIF is due to 
lesser numbers of tasks generated and balancing on the networks. When 
sufficient numbers of tasks are available a lesser value of LIF is obtained. 
Therefore, a good balancing is achieved when more numbers of tasks are 
available. 
To study the effect of proposed Two Round Scheduling on LEC, the 
simulation results for LEC are compared with other networks like LET, 
Hypercube and de Bruijn networks. The comparative study indicates similar 
behavior on all the multiprocessor networks for uniform load. The value of LIF 
starts reducing when sufficient numbers of tasks are available and finally 
approaches to zero in all the multiprocessors networks. However, the value of 
LIF is always lesser on LEC network in comparison to other networks. 
Particularly, the behavior of LIF is similar in case of LET and LEC networks. 
This indicates that the TRS algorithm is performing equally well on both the 
LET and LEC networks for uniform load. The reason for better performance in 
LET and LEC networks for uniform load is due to the lesser number of 
processors in the networks. 
To check further that the TRS is performing better on LEC or LET, the 
load balancing time (i.e. time taken to get a zero value of LIF) of TRS scheme 
is evaluated on both the networks (i.e. LEC and LET). The balancing time at 
various load stages for uniform task structure are evaluated as another 
performance parameter, for all the multiprocessor networks mentioned above 
and curves are plotted as Time verses Load at different stages, shown in Figure 
5.5. 
83 
BALANCING TIME FOR UNIFORM LOAD 
"in 1500 
E 
» 1250 
E 
H 
a> 1000 
O) 
!• 
o 750 
> < 
3 4 5 6 
Load at various stages 
•LEG •LET • Hypercube -de Bruijn 
Figure 5.5: Performance of LEC and other multiprocessor networks 
The performance results shown in Figure 5.5 indicate that in all the 
multiprocessor networks, the balancing time increases as the number of tasks 
increases. However, the balancing time in LEC network is always lesser as 
compared to other networks. Therefore, albeit the LEC and the LET networks 
produce similar results in terms of LIE when TRS is implemented, LET gives 
an inferior performance when balancing time is taken into consideration. Thus, 
it may be concluded that the LEC network is outperforming for uniform load in 
comparison to Hypercube and de Bruijn networks and gives comparable 
performance with LET network. 
84 
LIF FOR NON UNIFORM LOAD 
60 
50 
^ 4 0 
30 
iS 
0) 
> < 20 
5 6 7 8 9 10 
Load at various stages 
•LEG -•—LET —A—de Bruijn —»<- Hypercube 
Figure 5.6: TRS scheme on various multiprocessor networks 
In case of non uniform load (Figure 5.6), the effect of variation of load 
is clearly indicated in the curves. The LIF starts from zero and reaches to its 
high value which stays high for several stages of the task generation on all the 
multiprocessor networks. This high value of LIF is due to the lesser number of 
tasks running on the networks during these stages. The smaller number of tasks 
could not be balanced among all the processors of the networks which indicate 
an inefficient balancing in the networks. However, in all the cases the value of 
LIF starts decreasing towards minimum as the numbers of tasks are sufficient. 
In general the value of LIF is lesser in case of LEC network except the earlier 
stages of the load generation when sufficiently tasks are not available. 
Therefore, it may be concluded that the LEC network is performing better with 
TRS algorithm for non-uniform load also. 
85 
The performance of LEG is also evaluated in terms of load balancing 
time for non-uniform load. The TRS scheme is implemented on various 
multiprocessor networks and balancing time on each of the network is 
evaluated for various load stages. For comparison purpose the curves are 
plotted as Time verses Load (non-uniform) shown in Figure 5.7. 
BALANCING TIME FOR NON UNIFORM LOAD 
25 
20 
M 15 
I 10 
3 4 5 6 
Load at various stages 
•LEG —•—LET A Hypercube X de-Bruijn 
Figure 5.7: Performance of LEG and other multiprocessor networks 
The performance results shown in Figure 5.7 indicate the behavior of 
balancing time when TRS scheme is implemented on various multiprocessor 
networks for non-uniform load. It is observed from the curves that there is no 
regular pattern in the balancing time with load. In case of non-uniform load the 
behavior of the tasks is unpredictable; therefore, the balancing time varies on 
each of the multiprocessor network. In general, the Hypercube and de Bruijn 
86 
networks show similar behavior in the time. On the other hand LEC and LET 
indicate similar effect on the change in balancing time for various load stages. 
This is due to the fact that both the LEC and LET networks have smaller 
number of processors. It might be possible that the tasks available are lesser 
and consequently these lesser tasks are efficiently balanced on smaller number 
of processors. However, LEC shows the lesser balancing time. 
5.4 Performance Study of TRS and other Dynamic Scheduling 
Schemes on LEC network 
The basic aim of the current study is to develop a good organizational 
model (i.e. suitable topological layout with appropriate scheduling strategy). 
Therefore, to authenticate the performance of the proposed LEC network with 
the proposed TRS scheme, it is desirable to implement other standard dynamic 
scheduling schemes on LEC. In the present work, in addition to the proposed 
TRS scheme, some reported dynamic scheduling schemes have also been 
implemented on the LEC network under the same environment [LeMair and 
Reeves, 1993]. In particular, the following three scheduling schemes were 
considered and have been implemented on LEC network after appropriate 
modification. These schemes have given optimal performance on the particular 
multiprocessor networks for which they are designed. 
• Minimum Distance Scheduling 
• Hierarchical Balancing Method 
• Gradient Model 
The Minimum Distance Scheduling (MDS) scheme [Rafiq et al., 1999] 
originally initiates the load balancing process based on the connectivity of the 
network. It considers the directly connected processors only in each step of 
load balancing and is applied on LEC. 
87 
The HBM scheme [LeMair and Reeves, 1993] has already been 
discussed in detailed in Chapter 3. To implement HBM scheme on the 
proposed LEC network, it is modified in such a way that the complete network 
is divided into three segments indicated by level numbers namely, Level 0, 
Level 1, and Level 2. These levels are arranged in terms of their hierarchy. 
Balancing starts from the lowest level and ascends to the highest level and 
therefore covers the whole system. The hierarchical structure of LEC is shown 
in Figure 5.8. 
In Gradient Model (GM) scheme [LeMair and Reeves, 1993], load 
transfer is restricted along the direction of the most lightly loaded processors. 
That is, an overloaded processor will send its excess load only to one neighbor 
processor at each step of the load balancing process. 
Level 2 fMiddle Level") 
Level 0 (Hiehest Level) 
Level 1 ^Lowest Level") 
Figure 5.8: The hierarchical structure of LEC network 
5.4.1 Comparison of TRS scheme with other scheduling schemes 
To study the behavior of different scheduling schemes on the LEC 
network, the simulation study has been carried out in the same environment 
and the LIFs are computed for both uniform and non-uniform types of load 
structures. The estimation of average percentage of LIF is obtained and the 
curves are plotted against the various load stages shown in Figure 5.9 & 5.10. 
88 
VARIOUS DYNAMIC SCHEDULING SCHEMES ON LEC 
(UNIFORM LOAD) 
225 
200 • 
^ 175 1 
£:. 150 i 
U-
j 125 ' 
D> 100 
S 
> 
< 
75 
50 
25 
0 
4 5 
Load at various stages 
-TRS •MDS •HBM •GM 
Figure 5.9: Comparison of TRS with other scheduUng schemes on LEC 
The trend of curves obtained in Figure 5.9, indicate the behavior of 
various scheduling schemes with uniform load when implemented on LEC 
network. The TRS scheme shows the lesser imbalance with negligible average 
value of LIF as we tend to higher load stages. The value of LIF becomes zero, 
when the network receives good amount of load and thus, it is immediately 
balanced. Therefore, it can be said that the TRS scheme has a high degree of 
balancing, which makes the LEC network as effectively utilized in comparison 
to other scheduling schemes, when implemented on LEC. The other scheduling 
schemes are unable to achieve high degree of balancing for uniform type of 
load structure. 
89 
VARIOUS SCHEDULING SCHEMES ON LEC 
(NON UNIFORM LOAD) 
225 
200 i 
g 
175 1 
150 i 
u- ' 
J 125 1 
< 
100 : 
75 ; 
50 
25 
0 
6 7 8 9 10 11 
Load at various stages 
12 13 14 15 
-TRS -MDS -HBM -GM 
Figure 5.10: Comparison of TRS with other scheduhng schemes on LEC 
The above curves (Figure 5.10), shows the load balancing on LEC for 
non-uniform load when implemented using different scheduling schemes. It is 
clear from the curves that for every stage of the load, the average value of LIF 
through TRS scheme indicates the superior values, i.e. lesser values at every 
point as well as lesser values of LIF at peaks with respect to other scheduling 
schemes. The TRS scheme reduces the imbalance linearly and approaches to 
minimum as the load increases. Again it shows that the degree of balancing of 
non-uniform load on LEC is high i.e. effective utilization of the proposed 
network [Samad et al., 2010]. 
From the comparison made on the graphs based on various simulation 
results, it may be concluded that TRS scheme is performing well on LEC 
network considering the factor of LIF and its balancing time. The proposed 
90 
TRS scheduling scheme is performing better, degree of balancing is higher and 
the network utilization is efficient. Therefore, it can be concluded that proposed 
LEC network and the proposed TRS scheme is the ideal organizational model. 
The organization is found to be performing better particularly for unpredictable 
load. 
In the next chapter, the six nodes LEC multiprocessor is proposed to be 
used as an information retrieval server. A new algorithm for information 
retrieval is proposed and implemented on LEC to test its performance for 
retrieval of information. 
91 
CHAPTER 6 
LEC as Information Retrieval Server 
The Internet has become one of the key global infrastructures. Many 
documents and files are being downloaded from the Internet. As a result, 
popular Web and ftp sites have to deal with heavy loads. On the other hand, 
due to increase in traffic for the access (load) over Internet, user's expectations 
are also high. Generally, it is required that the desired information should be 
downloaded in the shortest possible time. In the present scenario, the speed at 
which the requests are handled by the server is becoming critically important as 
compared to other factors such as available bandwidth etc. Therefore, 
providing fast and effective information retrieval is a challenging task in the 
current research. The main issue is that how to cope up with the problem of 
such fast and effective download, in order to achieve the desired quality of 
performance. 
There are different ways in which the present web sites handle huge 
amount of Internet traffic. Broadly, they may be classified into the following 
three categories: 
• Parallel Downloading 
• Caching method 
• Server performance enhancement 
92 
In Parallel Downloading (PD) scheme, popular documents are often 
maintained on mirror servers and a client requesting a file open concurrent 
connection to multiple servers. Several experiments showing that PD results in 
higher aggregated downloading throughput and therefore shorter downloading 
time experienced by the clients have been reported [Philopoulos and 
Maheswaran, 2001], [Gkantsidis et al., 2003], [Funasaka et al, 2005]. The 
shorter downloading time is achieved at the expenses of more complexity and 
overhead incurred to coordinate the servers in maintaining a large number of 
concurrent request»{Rodriguez and Biersack, 2002], [Koo et al., 2003], [Leung 
and Li, 2006]. 
In caching method, instead of retrieval of information from the original 
server, the information is downloaded from the cache. Jin et al. proposed a 
partial caching method for reducing the response time. It combines download 
from a cache with that from the original server to accelerate retrieval. It can 
reduce traffic and improve overall streaming quality [Shi et al., 2002], [Jin et 
al., 2002], [Shen and Xu, 2009]. 
The present work put emphasis on the last option. Designing high 
performance machines/servers is an important and effective way for enhancing 
the performance of Internet based services. Several software and hardware 
scale-up techniques have been proposed to enhance the performance of single 
node server. In software approach, a high throughput can be obtained by 
employing larger data cache and effective cache replacement techniques 
[Cardellini et al., 2002], [Shen and Xu, 2009]. Nowadays, when hardware cost 
is continuously decreasing, additional computing power can be obtained by 
adding more processors and memory to a single system. One approach to 
design such machines is to incorporate the concepts of multiprocessor 
architectures [Foglia et al., 2000]. These servers can utilize the 
multiprocessing capability having n-processors under a single domain and 
hence, provide cost-effective solution. Commercial servers like Google and e-
93 
Bay have used this technique effectively [Barroso et al., 2003]. Ranjan and 
Knightly, proposed an architecture that includes a grid of clusters connected via 
high bandwidth links. The performance is optimized by considering both the 
server loads and network latencies. An algorithm is proposed to reduce client 
access time as well as to minimize the resource utilization when implemented 
on the proposed architecture [Ranjan and Knightly, 2008]. 
In this chapter, the scope of using the proposed LEC architecture for 
information retrieval is discussed. A new algorithm for retrieving information 
has been proposed and implemented on the proposed network. A comparative 
study has been carried out to evaluate the performance of LEC, when used as a 
server. Through simulation experiments, it is shown that the resource download 
time is reduced considerably. 
6.1 Server Architectures 
Different servers take different approaches towards enhancing their 
performance. Server architecture normally classified into one of the following 
categories [Choi et al., 2005]: 
(1) Multiprocess 
(2) Multi Threaded 
(3) Single Process Event Driven, 
(4) Asymmetric Multi Process Event Driven 
6.1.1 Multi Process Server 
In the multi-process (MP) server architectures, a process is assigned to 
execute the basic steps associated with serving a client request sequentially. 
The process performs all the steps related to serve a request before it accepts a 
new request. Since multiple processes are employed, many requests can be 
94 
shared concurrently. Each process has its own private address, therefore, the 
main drawback of this model is the difficulty to share any global information 
such as cache information, address space etc. An MP based server needs more 
memory to maintain the same cache size per process compared to other server 
models. Therefore, the performance of these types of servers is inferior as 
compared to other models. 
6.1.2 Multi Threaded Server 
The Multi-Threaded (MT) model consists of multiple kernel threads 
with a single shared address space. Threads are scheduled on a processor, and 
each thread executes all the steps required to serve a request independent to 
other processes and threads. The main advantage of this model is that threads 
can share many of the process's resources such as address space, open files, 
data cache etc. Especially the data cache is shared among all threads. The 
drawback of these types of servers is that, they are not supported by all the 
operating systems and in such cases sharing of data cache information among 
many threads may lead to high synchronization overhead. The widely used 
Apache server was originally designed as an MP model. Later, it was modified 
by incorporating the concept of MT models [Apache, 2003]. 
6.1.3 Single Process Event Driven Server 
The Single Process Event Driven (SPED) model performs well for work 
loads when most of the requests content is in main memory. SPED servers 
provide better performance for cached workload and avoid synchronization 
overheads among threads or processes by overlapping CPU, disk and network 
operations. However, this feature does not work, when it performs disk related 
operations due to the limitations of some current operating systems [Pai et al, 
1999]. The server based on this model is implemented by Zeus technology 
[Zeus, 2003]. 
95 
6.1.4 Asymmetric Multi Process Event Driven Server 
The Asymmetric Multi Process Event Driven (AMPED) model is an 
improved version of SPED and thus removes the weakness of SPED model. 
The servers based on this model combine the event driven approach of the 
event driven architecture with multiple helper processes. The helper processes 
are responsible to handle all the disk oriented requests. The main server only 
serves the cache-hit requests. If there is a cache miss, the main process forward 
the request to a helper process, and then, the helper process fetches the data 
from the disk and sends it back to the main process by using the Inter-Process 
Communication (IPC). 
6.2 The proposed LEC Server 
All these server models, discussed in the above section, were basically 
proposed for single processor system. However, with the advent of System on 
Chip (SoC) architectures, high performance server models can be designed. 
With the technology scaling down rapidly, it would be possible to fabricate 
SoCs with larger number of processors [Benini and Micheli, 2002], [Bell and 
Gray, 2003], [Edenfeld et al., 2004], [Tse, 2009]. It is observed that cost of 
installing and maintaining a k-processors system is significantly less than 
having k- separate single processor systems. The motivation of the proposed 
LEC server relies on this context, and attempts to see how the server design can 
benefit from the architectural innovations. Effort is on to investigate how to 
exploit a multiprocessor architecture with lesser number of nodes (i.e. six 
nodes in LEC) shown in Figure 6.1, to build high performance Information 
Retrieval (IR) server. 
96 
97 
In order to utilize all the nodes available in the network efficiently (i.e. 
with minimum imbalance of load on the server), the requests are migrated from 
overloaded nodes to underloaded nodes. However, requests are transferred in 
various blocks to minimize the time involved in the migration. The proposed 
LEC server is tested by migrating different block sizes of requests and 
migration time is evaluated in each case under different loads. For this purpose, 
the proposed TRS scheme is modified in such a way that, the different values 
of LIFs are set and time taken to balance the load for the desired value of LIF is 
evaluated. The whole algorithm is implemented in "C" language. A pseudo 
code of the modified TRS algorithm is given in Table 6.1 and 6.2. 
TABLE 6.1: Procedure for Load Balancing (Fixed value of LIF) 
modified_trs() 
/* Let number of donors = n_donor and number of acceptors = naccr = 0 */ 
/* let the value of Load = Tload and number of processors = PMAX */ 
alif, ndonor = 0, naccr = 0, donor[ ], accr[ ], p[ ] ; 
/* calculate IL */ 
IL = ceil (Tload/PMAX); 
/* map the load among all the processors in the network */ 
s = 0; 
for (i=0;i< PMAX-l;i++) 
{ 
p[i].load = random (Tload); 
s = s + p[i].load; 
} 
p[i].load = Tload-s; 
/* Identify donors and acceptors */ 
for (i=0; KPMAX; i++) 
{ 
if(p[i].load>IL) 
98 
donor [n_donor] = i ; 
n_donor++; 
} 
for (i=0; KPMAX; i++) 
{ 
if(p[i].load<IL) 
accr [naccr] = i ; 
n_accr++ ;} 
/* Calculate the actual value of LIF*/ 
/* Let maxload is the maximum load available on one of donors*/ 
maxload = max(donor[ ], n d o n o r ) ; 
lifac = (maxload - IL)/ IL ; 
plifac-lifac*100; 
/* Perform migrations among connected overloaded and underloaded processors, if 
actual LIF(%) is lesser than the fixed value of LIF (input value) */ 
Let level of connectivity = 1 ; 
while (plifac > LIF) 
{ 
for (i = 0; i < n_donor; i++) 
{ 
while (donr[i] is overloaded) 
{ 
for (j = 0; j < n_accr; j++) 
{ 
if (donr[i] is connected to accr [j]) 
/* migrate load in different block sizes */ 
p[donor[i]].load = p[donor[i].load - blocksize ; 
p[accr[i]].load = p[accr[i]].load +block_size ; 
} 
donr (i) is exhausted or balanced; 
}} 
99 
if the value of plifac is still greater than the fixed value of LIF, then set the level of 
connectivity to 2 and repeat the above procedure ; 
} 
/* Functions used by the algorithm*/ 
max (donor [ ], ndonor) 
{ 
max = 0 ; 
for (i = 0; i < ndonor; i++) 
if( max < p[donor [i].load) 
max = p[donor[i]].load; 
return (max) ; 
} 
end of procedure 
TABLE 6.2: Procedure to calculate the balancing time for a fixed value of LIF 
start = clock ( ) ; 
call the procedure to balance the processor load (modified trs) for a given value of 
LIF 
end = clock ( ) ; 
time = (end - start) * 1000 / CLKTCK ; 
end of procedure 
In the above algorithm, two parameters are taken into consideration; the 
first is a fixed value of LIF and the second is the block size. Block size is the 
sum of requests (tasks) taken into consideration for migration in a balancing 
action. For a particular value of LIF the timings are computed by having 
different size of loads on the LEC server. The results are computed for a load 
of 1000, 5000 and 10000 tasks (requests). The modified TRS scheme has been 
implemented on IBM server X series 226 having Intel Xeon 3.0 GHz processor 
100 
in the same environment. The simulation results show a similar behavior for 
other types of the above mentioned load, however, the time to attain a 
particular value of LIF differs when tasks from donor to acceptor are migrated 
with different block sizes. Figure 6.2 & 6.3 show this behavior by considering a 
load of 1000 tasks. 
3500 
3000 
S 2500 
I 2000 
S) 1500 
2 
> 1000 
LIF Vs TIME (FOR VARIOUS BLOCK SIZES) 
500 
0 I 
0 10 20 30 
LIF (%) 
40 50 60 
• Block si2e=3 —•— Block size=5 
Figure 6.2: Performance of LEC IR server 
101 
LIF Vs TIME (FOR VARIOUS BLOCK SIZES) 
1800 1 
1600 '-
^ 1400 I 
w ! 
£ 1200 i 
<o I 
E 1000 1 
P I 
® 800 ^ 
2 
« 600 
> 
*'• 400 ; 
200 ^ 
0 i 0 10 20 30 40 50 60 
LIF (%) 
I » Block size =8 • Block size=16 
Figure 6.3: Performance of LECIR server 
It is observed from the curves shown in Figure 6.2 & 6.3, that smaller 
time to balance the load is obtained when requests are migrated in larger block 
sizes. The smaller time in LEC server is due to the fact that LEC posses high 
connectivity with mesh topology. For larger values of block sizes, though the 
time taken to balance the load is smaller, however, the variance in the 
balancing time is not significant when larger values of LIF are set (which is 
beyond the tolerance limit of 40% LIF). In general, it may be concluded that by 
increasing the block size of the requests, migration time could be reduced 
[Samad et al., 2009]. This effect is clearly depicted in Figure 6.4. Therefore, 
when large requests are made to the server, it automatically selects the larger 
block size for migration of query requests among different nodes to optimize 
the performance of the server. 
102 
3000 
i I 
BLOCK SIZE Vs. TIME 
10 15 20 
Block size 
-Time (ms) 
35 
Figure 6.4: Performance of LEC for various block sizes 
6.3 Information Loading and Retrieval 
Web is a huge and plentiful content useful to millions of users. 
Therefore, it is very difficult to analyse and classify the whole information 
available on Web. Search engines have solved some problems of information 
retrieval from Internet. However, additional problems appear in search engine, 
such as information overloading which reduces the efficiency [Cardellini et al., 
2001], [Xi-Dao et al., 2000], [He et al., 2007]. If the retrieval of information is 
large, it causes the delay in down loading the information. The proposed server 
tries to classify the collected information of its database, and this classification 
is made in such a way that whenever any information is retrieved fi-om the 
database, it is retrieved almost instantly. If the required information is not 
available, a link for the same is searched for other servers. The whole retrieval 
system consists of three things i.e. an analyser, Indexer and a user interface 
along with the database, shown in Figure 6.5. 
103 
Figure 6.5: Information loading and retrieval system 
6.3.1 Loading of Information on LEG Server 
The major steps of the proposed server are as follows: The server 
collects information in the form of web pages or files from the internet. Each 
file when loaded into the database of the server is first analysed in terms of its 
size. An ID is assigned and the file is divided into several parts (or packets) of 
equal sizes. Each packet is again designated by its ID, loaded into the database. 
The addresses of packets in the database along with their IDs are recorded in a 
table, which is shared by all the nodes of the server and updated periodically. 
This table may be viewed as the indexer which indexes the data as shown in 
Figure 6.3. The ID of each packet is recorded in the table in such a way that the 
vector address of the table corresponds to the packet ID. The algorithm is 
designed and shown in Table 6.3 for the above procedure and implemented on 
LEG, so that the round trip time between nodes is reduced. 
6.3.2 Retrieval of Information from LEG Server 
A specialized search engine is usefiil when detailed or high specific 
information about a subject area is needed. It can match the personal 
requirements better than a general search engine [Bei et al., 2002]. Keeping this 
in view, an information retrieval system has been designed. To download the 
information rapidly and accurately a user interface is designed that works in 
association with the loading mechanism. There are numerous methods for 
evaluation of information retrieval system such as search capability, precision, 
presentation of output and user effort. However, in the present work stress is 
104 
made how fast the information is retrieved after a query is accepted by the 
proposed server. When particular information is being retrieved, it will be first 
searched in the table where its detail information is available. The table is 
available to all the nodes of the server. By using the details available in the 
table, the complete information is then retrieved by different nodes 
concurrently from the existing database. Users submit a query and receive 
result pages almost instantaneously. Moreover, the model also searches the web 
information resource on its ovra initiative and traces the change of the Web 
information, in order to automatically update and extend the local resource 
periodically. For those search requests that are not in the database, provisions 
are available that it can automatically select the commercial search engines to 
search information and perform classification and integration of information 
received from different search engines. The proposed algorithm is tested on the 
LEC for a query set that consists of around 1500 queries. A pseudo code of the 
whole algorithm is shown in Tables 6.3 and 6.4. 
Table 6.3: Procedure for searching an item (procedure: 1) 
search ( ) 
# define packetsize 8 
/* Define packet structure */ 
class packet 
{ 
char id [6]; 
int data [size]; 
}; 
/* Define packet table*/ 
packet packettable [ ] [ ]; 
/* Generate Packet_data */ 
packet p; 
p.id [0] = random (n) + 'A'; 
temp = random (10000) +1; 
105 
itoa (temp, str); 
strcat (id, str); 
for (i = 0; i < = packetsize; i++) 
p.data [i] =random (1000) +1; 
/* Repeat above procedure for all packets*/ 
packet packetsearch ([number of class] [n]); 
{ 
packet packet_table []; 
for (i = 0; i < = n; i++) 
{ 
packet_table=info(packet); 
info-(class,packet_id,packet_data); 
packet_data= {size, random()}; 
}} 
/* Search information in a particular class*/ 
search_all(class,info) 
{ 
Call search_packets (desclass) 
{ 
if (output_packet == des_class) 
{ 
for(i=0; i< size; i++) 
{ 
Call search_info_packet(info) 
if(info_flag==l) 
retum(info) 
else 
retum(O); } 
} } } 
End of procedure 
106 
Table 6.4: Procedure to calculate average searching time (procedure: 2) 
/* Calculate average searching time to process the queries on six nodes LEC 
server*/ 
Generate no_of_thieads = no_of_processor = N; 
for(i=l;i<=N;i++) 
{ 
Map the available load Qj at each Pj, i.e. l<=i<=N; 
Call function for mapping the load; 
/* Start all threads representing a processor with load Qj */ 
/* Start process for each i */ 
for(i=l;i<=N;i++) 
{ 
start = clock ( ) ; 
Call procedure 1 for each Pj /* make search for each Pj*/ 
end = clock ( ) ; 
t[i] = (end - start); 
} 
tl = max(t(i), N); 
t2 =min (t(i), N); 
t_avg = (tl+t2)/2; 
/* Functions used by the algorithm*/ 
/* map the load among all the processors in the network */ 
{ 
s = 0; 
for (i=0; i < N; i++) 
{ 
P[i].load = random (Tload); 
s = s + p[i].load; 
} 
P[i].load = Tload-s ; 
} 
/* Find the maximum time consumed to process a request */ 
float max (float temp[ ], int n) 
{i = 0; 
float maxi = temp [0]; 
for(i=l;i<=N;i++) 
{ 
if (max <temp [i]) 
max = temp [i]; 
} 
return (max) 
} 
/* Find the minimum time consumed to process a request */ 
float min (float temp[ ], int n) 
{ 
i = 0; 
float min = temp [0]; 
for(i=l;i<=N;i++) 
{ 
if (min > temp [i]) 
min = temp [i]; 
} 
return (min) 
} 
end of procedure 
108 
6.4 Simulation Results 
In order to understand the effectiveness of the LEC server for the 
purpose of information exchange, a comparative simulation study has been 
carried out to evaluate the performance of the LEC server and a MP type single 
processor system. The proposed algorithm is implemented in the same 
environment for both the architectures. A number of search queries have been 
examined and the average times taken to process these queries are computed 
and are shown in Figure 6.6. 
160 , 
140 
120 : 
SE 
«: 
100 
1 80 60 -] 
40 ; 
20 1 
0 • 
0 150 300 
ARCHING TIME Vs. QUERIES 
450 600 750 900 1050 1200 1350 1500 
Number of Queries 
• LEC server —•— Uni-processor server! 
Figure 6.6: Comparison between LEC and MP type uni-processor servers 
The curves show that the search time for MP system is always greater 
than the multi-processor system (proposed LEC server). It is observed from the 
curves obtained that query searching time is increasing almost linearly in case 
of MP system, when the number of queries are increasing. On the other hand. 
109 
in the propose LEC server, the searching time does not increase rapidly by 
increasing the number of queries. For lesser number of queries the query search 
time is approximately 6 times, or greater in MP type server, in comparison to 
six nodes LEC server. For large number of queries the ratio of searching time 
decreases for LEC server, which takes lesser than 1/6"^  time for searching the 
same number of queries in comparison to MP type server [Samad et al., 2008]. 
The proposed LEC server supports a wide range of IR commands such 
as query, documents etc. Any node of the server can receive a client request 
and become the initial service node for that request. The request is served by 
the initial node based on the contents available at the node. Every node 
maintains the information related to the request, if it is available in the local 
database. If the request is large or the initial node is overloaded than the request 
is forwarded to other nodes of the system. For load balancing purpose, every 
node shares the information related to the request, therefore, each node works 
as a service node. 
It may be concluded from the above discussion that the proposed 
algorithm, when implemented, on the LEC server is giving better results for 
large number of queries. The proposed architecture with the proposed scheme 
is performing better especially for unpredictable and bursty queries and hence, 
the proposed LEC could be used as an information retrieval server in the 
networking. 
110 
CHAPTER 7 
Conclusion and Future Work 
The numbers of Internet users have increased rapidly over the last few 
years. This large user base is creating significant stress on the computing 
resources of popular services available on the Internet. Due to increase in the 
demand of information transactions, the existing approaches have their 
limitations. The critical nature of many online transactions requires a high 
speed server. Since such servers are anticipated to be one of the bottlenecks in 
handling Internet based services, therefore, improving its performance has 
become a critical issue to cope with the increasing use of Internet based 
services. 
Exploiting parallelism is now a necessity to improve the design of high 
performance computer systems. In terms of hardware, this typically means 
providing, multiple simultaneously active processors. In terms of software, it 
means structuring a program as a set of largely independent subtasks. An 
efficient management of parallelism involves optimizing conflicting 
performance indices, like the minimization of communication and scheduling 
overheads, and even load distribution among the processors. Such issues are 
addressed at the organizational level by designing a suitable topological layout 
of the network and an appropriate scheduling mechanism. 
111 
The work carried out in this thesis can be divided into three parts. First 
part deals with the design of a low cost multiprocessor architecture namely 
Linearly Extensible Cube (LEC) network and to study and compare its 
characteristics. The second one deals with the design of a dynamic scheduling 
strategy, it's testing on the proposed network as well as on other multiprocessor 
networks. Other scheduling schemes are also implemented on LEC for various 
types of loads in the same environment. After the comparative simulation 
studies of the proposed scheduling scheme with the other dynamic scheduling 
schemes on LEC network, the proposed organization (proposed architecture 
and proposed scheduling scheme) was found to be the best. Lastly, the 
performance of the LEC network has been evaluated for the information 
retrieval application from the Internet and has been used as a server. 
7.1 Conclusions 
The overall performance of the multiprocessor system is affected by a 
number of factors, such as communication delays, imbalance of load among the 
processors and scheduling overheads (problem partitioning, task allocation and 
balancing). A close correspondence between the structure of the problem and 
processor interconnection is desired in order to minimize these overheads. 
Scheduling plays a vital role to improve the performance, which is guided by 
some constraints, which may differ from application to application. 
The basic skeleton of the proposed multiprocessor network, named as 
Linearly Extensible Cube (LEC), whose size grows linearly has been 
developed. Some of the properties observed for the LEC network as compared 
to other similar multiprocessor architectures namely Hypercube, de Bruijn and 
LET networks are as follows: 
112 
1) In LEC network, the number of nodes at a level n, is N=2n (for n>0), 
n 
whereas the number of nodes in LET network N = ^ k . In hypercube 
and in de Bruijn the number of nodes is equal to 2". Thus, LEC network 
is more economical than other networks. 
2) The degree of a node in the proposed model is always 4. The 
connectivity of the Hypercube is equal to the number of dimensions in 
the cube, while in case of de Bruijn or in LET it is 4 or less. 
3) In LEC network, the complexity of extension increases in a constant 
manner. Each extension requires an addition of only two nodes, whereas 
in LET network the extension complexity increases linearly because 
each extension requires adding a single layer of n+1 node. Hypercube 
and de Bruijn networks though are extensible but the complexity 
increases exponentially by the power of 2. 
4) The bisection width of LEC is directly proportional to the number of 
nodes in the network at the particular level, so with each extension of 
level the bisection width is also increases. Hypercube and de Bruijn 
though have high bisection width but at the expense of more number of 
edges per node. In LEC network, the diameter does not always increases 
with the addition of a layer of nodes. 
5) The proposed LEC network is better fault tolerant. It has a meshed 
topology, and hence any single faulty link or any faulty node can be 
bypassed by only two additional hops. 
To check for the good organizational model, a dynamic scheduling 
scheme known as Two Round Scheduling (TRS) has been proposed and 
implemented on LEC. The performance of the LEC network with the proposed 
algorithm is evaluated for uniform and non-uniform loads. In this direction, 
investigations have been made through simulations studies and in the same 
113 
environment. The behavior of the scheduUng scheme is evaluated in terms of 
the performance index called Load Imbalance Factor (LIF), which represents 
the difference between the maximum load on a processor and IL, and achieves 
the optimal performance and the balancing time. 
The proposed scheduling scheme is also implemented on other 
multiprocessor networks namely Hypercube, de Bruijn, and LET in the same 
environment. Simulation studies show that the proposed TRS scheme performs 
better on LEC network for non-uniform load and equally good for uniform load 
as compared to other networks. The versatility of a multiprocessor system 
depends upon its interconnection network and appropriate scheduling strategy. 
In order to confirm the performance of the LEC network along with the 
proposed scheduling scheme, several other dynamic scheduling schemes such 
as Minimum Distance Scheduling (MDS), Hierarchical Balancing Method 
(HEM), and Gradient Model (GM) are also implemented on LEC, to check the 
superiority of the proposed TRS scheme. Simulation results show that the 
proposed scheduling scheme performs better for both uniform as well as for 
non-uniform loads on the proposed network, proving that the TRS scheme and 
LEC network is a good organizational model. 
To check the effectiveness of LEC when used as a server, it is tested for 
both; handling the unpredictable communication traffic by utilizing all the 
nodes of the network and servicing/processing a number of query requests 
efficiently (i.e. with minimum time). In the first case the proposed TRS scheme 
is modified in such a way that, the different values of LIFs are fixed and time 
taken to balance the load is evaluated. The curves plotted from the simulation 
studies indicate the average behavior of time taken with respect to the various 
values of average LIF under different load conditions. It has been observed that 
the average time decreases smoothly when the values of LIF increases. In 
general, the performance results show that a better result can be obtained by 
considering the LIF between 30 to 50%. Another performance parameter which 
114 
affects the performance of the server is the size of data (block), which is the 
candidate for migration from one node to the other. To consider the affect of 
block size, the simulation results are obtained for different block sizes. It has 
been found that the server performance can be improved by increasing the 
values of block sizes for a particular fix value of LIF. Since the algorithm is 
dynamic, it decides itself the size of blocks, depending upon the 
communication traffic (requests) it receives. Therefore, it may be concluded 
that when the proposed load balancing algorithm is applied to the LEG, it gives 
better performance for unpredictable, bursty and rapid communication traffic. 
The second algorithm developed for managing the information exchange 
is applied on LEG. A number of search queries have been examined and the 
average time taken to process different set of queries is computed. Simulation 
results are compared with Multi-Process (MP) uni-processor type system. It is 
observed that the LEG shows better exchange of information with proposed 
information retrieval algorithm. 
Gomparisons of the LEG multiprocessor network and its inherent 
qualities along with lesser number of processors reveal that this network is 
reasonably comparable and economical with the existing multiprocessor 
networks. From the simulation studies, it has been found that the proposed 
dynamic scheduling scheme is performing better on LEG network in 
comparison to other dynamic scheduling schemes. Therefore, it may be 
concluded that the proposed architecture with the proposed schemes is 
performing better for unpredictable and bursty data. Hence, due to its various 
characteristics and performance parameters, the LEG architecture could be used 
as a server in the networking. When LEG server is used to compare the 
retrieval of the information with a uni-processor server, it is found that when 
proposed algorithm is applied to both the server in the same environment, the 
LEG type server reduces the resource download time. 
115 
7.2 Future Work 
The following extensions are recommended to the work presented in the 
thesis. 
1) The LEC network and the scheduler may be implemented for real time 
problems, the actual program traces may be obtained in evaluating the 
performance of the network scheduling scheme. 
2) The VLSI layout of the LEC network can be studied and designed using 
FPGA (Field Programming Gate Array) for online implementation. 
3) Signal Processing systems have become very complex, particularly 
streaming nature of some signal processing algorithms could not be 
applied on the single node platform. These requirements tend to lead us 
to multiprocessing system on chips (MPSoCS). With the design of 
appropriate software, the proposed LEC network could be used as a 
platform to map signal processing applications. 
4) By applying the LEC network in the design of a grid computing and also 
in the cloud computing, it could be used for various applications. 
116 
REFERENCES 
[Abdel and Khaled, 1998] Abdel, A. and Khaled, D. (1998). The Hyperstar 
Interconnection Network. Journal of Parallel and distributed Computing, 
number 48, pages 175-199. 
[Akl, 1997] Akl, S. G. (1997). Parallel Computation: Models and Methods. 
Prentice Hall. 
[Anand et al., 1999] Anand, A., Chose, D., and Manj, V. (1999). ELISA: An 
Estimated Load Information Scheduling Algorithm for Distributed 
Computing System. Computers and Math with Applications, volume 37, 
pages 57-85. 
[Apache, 2003] The Apache HTTP Project (2003). The Apache software 
foundation. http:/httpd.apache.org. 
[Attiya, 2004] Attiya, H. (2004). Two phase Algorithm for Load Balancing in 
Heterogeneous Distributed Systems. In Proceedings of 12* Euromicro 
Conference on Parallel, Distributed and Network-Based Processing (Euro-
PDP'04), page 434-439. 
117 
[Bahi et al., 2003] Bahi, J. M., Couturier, R., and Vernier, F. (2003). Broken 
Edges and dimension Exchange Algorithm on Hypercube Topology. In 
Proceedings of the 11''' Euromicro Conference on Parallel, Distributed and 
Network-Based Processing (Euro-PDP'03), pages 140-145. 
[Bahi et al., 2005] Bahi, J. M., Couturier, R., and Vernier, F. (2005). 
Synchronous Distributed Load Balancing on Dynamic Network. Journal 
of Parallel and Distributed Computing, volume 16, number 11, pages 
1397-1405. 
[Baker, 2005] Baker, T. P. (2005). An analysis of EDF schedulability on a 
multiprocessor. IEEE Transaction on Parallel and Distributed Systems, 
volume 16, number 8, pages 760-768. 
[Banicescu and Velusamy, 2002] Banicescu, I. and Velusamy, V. (2002). Load 
Balancing Highly Irregular Computations with the Adaptive Factoring. In 
th 
Proceedings of 16 International Parallel and Distributed Processing 
Symposium (IPDPS'02), page 87-98. 
[Barroso et al., 2003] Barroso, L. A., Dean, J. and Hlzle, U. (2003). Web 
search for a planet: The Google Cluster Architecture. IEEE micro. 
Volume 23, number 2, pages 22-28. 
[Beaumont et al., 2008] Beaumont, O., Carter, L., Ferrante, J., Legrand, A., 
Marchal, L., and Robert, Y. (2008). Centralized versus Distributed 
Schedulers for Bag-of Tasks Applications. IEEE Transaction on Parallel 
and Distributed Systems, volume 19, nimiber 5, pages 698-709. 
118 
[Bei et al., 2002] Bei, Z., Zhaongmeng, Z., Lulin, Z., Liping, W. (2002). 
Building a specialized search engine of special subject. In proceedings of 
IEEE TENCON, pages 69-72. 
[Bell and Gray, 2003] Bell, G. and Gray, J. (2003). What is next in High-
Performance Computing? Communication of the ACM. Volume 45, 
number 2, pages 22-28. 
[Benini and Micheli, 2002] Benini, L. and Micheli, G. D. (2002). Networks on 
Chips: Anew SoC paradigm. IEEE Transaction on Computer, volume 35, 
number 1, pages 70-78. 
[Bertogna et al., 2009] Bertogna,'M., Cirinei, M., and Lipari, G. (2009). 
Schedulability analysis of Global scheduling algorithm on multiprocessor 
platforms. IEEE Transaction on Parallel and Distributed Systems, volume 
20, number 4, pages 553-566. 
[Bhuyan and Agrawal, 1984] Bhuyan, L. N. and Agrawal, D. P. (1984). 
Centralized hypercube and hypercubes structure. IEEE Transaction on 
Computer, volume 33, number 4, pages 323-333. 
[Boeres et al., 2003] Boeres, C, Lima, A., and Rebello V.E.F. (2003). Hybrid 
Task Scheduling: Integrated Static and Dynamic Heuristics. In 
Proceedings of 15* Symposium on Computer Architecture at High 
performance Computing, pages 199-206. 
119 
[Cardellini et al., 2001] Cardellini, V., Casalichhio, E., Colajanni, M., and 
Mambelli, M. (2001). Web switch support for Differentiated services. 
ACM Performance Evaluation Rev., volume 29, pages 14-19. 
[Cardellini et al., 2002] Cardellini, V., Casalicchio, E., Colajanni, M., and Yu, 
P. S. (2002). The State of the Arts in Locally Distributed Web Server 
Systems. ACM computing Surveys, volume 34, number 2, pages 1-49. 
[Chandra and Shenoy, 2008] Chandra, A. and Shenoy, P. (2008). Hierarchical 
Scheduling for Symmetric Multiprocessors. IEEE Transaction on Parallel 
and Distributed Systems, volume 19, number 3, pages 418-431. 
[Chang and Chen, 2006] Chang, R. W. and Chen, S. G. H. (2006). Node-
disjoint paths in hierarchical hypercube networks. In Proceedings of 20 
International Parallel and Distributed Processing Symposium, pages 25-
29. 
[Chang et al., 2008] Chang, R. S., Guo, M. H., and Lin, H. C. (2008). A 
multiple parallel download scheme with server throughput and client 
bandwidth considerations for data grid. Future Generation Computer 
Systems, volume 24, pages 798-805. 
[Chao and Li, 2004] Chao, C. H. and Li, J. S. (2004). A novel BIB based 
Parallel Downloading scheme. In Proceedings of IEEE Asia-Pacific 
Conference on Circuit and Systems, pages 461-464. 
120 
[Chen et al., 2005] Chen, H., Gong, Z., and Huang, Z. (2005). Parallel 
downloading algorithm for large-volume files distribution. In Proceedings 
of 6* International Conference on Parallel and Distributed Computing, 
Application and Technologies, pages 745-749. 
[Choi et al., 2005] Choi, G. S., Kim, J. H., Ersoz, D., and Das, R. (2005). A 
Multi-Threaded PIPELINED Web Server Architecture for SMP/SoC 
Machines. In Proceedings of International World Wide Web Conference, 
Japan, pages 730-739. 
[Ciardo et al., 2001] Ciardo, G., Riska, A., and Smimi, E. (2001). Equi Load: A 
Load Balancing Policy for Clustered web servers. Performance 
Evaluation. Journal of Parallel and Distributed Computing, volume 46, 
pages 101-124. 
[Corbalan et al., 20005] Corbalan, J., Martorell, X., and Labarta, J. 
Performance-Driven Processor Allocation. IEEE Transaction on Parallel 
and Distributed Systems, volume 16, number 7, pages 599-610. 
[Cortes et al., 1999] Cortes, A., Ripoll, A., Senar, M. A., and Luque, E. (1999). 
Performance Comparison of Dynamic Load-balancing Strategies for 
Distributed Systems. IEEE Proceedings of the 32* Hawai international 
Conference on System Sciences, volume 8, pages 8041- 8051. 
[Cybenko, 1989] Cybenko, G. (1989). Dynamic load balancing for distributed 
memory multiprocessors. Journal of Parallel and Distributed Computing, 
volume 7, pages 279-301. 
121 
[Darbha and Agrawal, 1998] Darbha, S. and Agrawal, D. P. (1998). An 
Optimal Scheduling Algorithm for Distributed memory Machines. IEEE 
Transaction on Parallel and Distributed Systems, volume 9, number 1, 
pages 87-95. 
[Day and Tripathi, 1994] Day, K. and Tripathi, K. (1994). A Comparative 
Study of Topological Properties of Hypercube and star Graphs. IEEE 
Transaction on Parallel and Distributed Systems, volume 5, number 1, 
pages 31-38. 
[Dobber et al, 2009] Dobber, M., Mei, R. V. D., and Koole, G. (2009). 
Dynamic Load Balancing and Job Replication in a Global-Scale Grid 
Environment: A Comparison. IEEE Transaction on Parallel and 
Distributed Systems, volume 20, number 2, pages 207-218. 
[Dobber et al., 2004] Dobber, A. M., Koole, G. M., and Van der Mei, R. D. 
(2004). Dynamic Load Balancing for Grid Application. In Proceedings of 
International Conference on High performance Computing (HiPC'04), 
pages 342-352. 
[Dobber et al., 2005] Dobber, A. M., Koole, G. M., and Van der Mei, R. D. 
(2005). Dynamic Load Balancing Experiments in a Grid. In Proceedings 
of 5*'' IEEE International Symposium on Cluster Computing and the Grid 
(CCGrid'05), pages 123-130. 
122 
[Doulamis et al., 2007] Doulamis, N. D., Doulamis, A. D., Varvarigos, E. A., 
and Varvarigou, T. A. (2007). Fair Scheduling Algorithms in Grids. IEEE 
Transaction on Parallel and Distributed Systems, volume 18, number 11, 
pages 1630-1648. 
[Dykes et al., 2000] Dykes, S. G., Robbins, K. A., and Jeffery, C. L. (2000). An 
empirical evaluation of client-server selection algorithms. In Proceedings 
of 19* Annual joint Conference of IEEE Computer and Communication 
societies, volume 3, pages 1361-1370. 
[Edenfeld et al., 2004] Edenfeld, D., Kahng, A. B., Rodgers, M., and Zorian, Y. 
(2004). Technology roadmap for semiconductors. IEEE Transaction on 
Computer, volume 37, number 1, pages 77-56. 
[Esfahanian and Sagan, 1993] Esfahanian, L. M. Ni. and Sagan, B. E. (1993). 
The Twisted N-Cube with application to multiprocessing. IEEE 
Transaction on Computer, volume 40, number 4, pages 88-93. 
[Foglia et al., 2000] Foglia, P., Giorgi, R., and Prete, C.A. (2000). 
Performance Analysis of Electronic Commerce Multiprocessor System. In 
Proceedings of 33'*' Hawaii International Conference on System Sciences, 
pages 1-9. 
[Funasaka et al., 2003] Funasaka, J., Nakawaki, N., and Ishida, K. (2003). A 
parallel downloading method of coping with variable band width. In 
Proceedings of 23''' International Conference on Distributed Computing 
System Workshops, pages 14-19. 
123 
[Funasaka et al., 2005] Funasaka, J., Kawano, A., and Ishida, K. (2005). 
Implementation Issues of Parallel Downloading Methods for a Proxy 
System. In Proceedings of 25^ ^ IEEE International Conference on 
Distributed Computing Systems Workshop (ICDCSW'05), volume 1, 
pages 58-64. 
[Ganeshan and Pradhan, 1993] Ganeshan, E. and Pradhan, D. K. (1993). The 
hyper-deBruijn network:scalable versatile architecture. IEEE Transaction 
on Parallel and Distributed Systems, volimie 4, number 9, pages 962-978. 
[Ghose and Desai, 1995] Ghose, K. and Desai, K. R. (1995). Hierarchical 
Cubic Network. IEEE Transaction on Parallel and Distributed Systems, 
volume 6, number 4, pages 427-435. 
[Gkantsidis et al., 2003] Gkantsidis, C, Ammar, M., and Zegura, E. (2003). On 
the Effect of Large-Scale Development of Parallel Downloading. In 
Proceedings of 3'** IEEE Workshop on Internet Applications (WIAPP 03), 
pages 79-89. 
[Gomez et al., 2006] Gomez, M. E., Nordbotten, N. A., Flich, J., Lopez, P., 
Robles, A., Duato, J., Skeie, T., and Lysne, O. (2006). A Routing 
Methodology for Achieving Fault Tolerance in Direct Networks. IEEE 
Transactions on Computers, volume 55, number 4, pages 400-415. 
[Grama et al., 2003] Grama, A., Gupta, A., and Karyps, G. (2003). Introduction 
to Parallel Computing (Second Edition) [M]. Addision-Wesley Press. 
124 
[Grindley et al., 2000] Grindley, R., Abderlrahman, T., Brown, S., Caranci, S. 
etc. (2000). The NUMAchine multiprocessor. In Proceedings of 
International Conference on Parallel Processing, pages 487-496. 
[Grosu and Chronopouls, 2002] Grosu, D. and Chronopouls, A. T. (2002). A 
Game-Theoretic Model and Algorithm for Load Balancing in Distributed 
Systems. In Proceedings of 16* International Parallel and Distributed 
Symposium (IPDPS'02), volume 2, pages 146-153. 
[Hamid and Hall, 1994] Hamdi, I. and Hall, R. W. (1994). An efficient class of 
intercormection networks for parallel computation. The computer Journal, 
volume37, number 3, pages 206-218. 
[He et al., 2007] He, B., Patel, M., Zhang, Z., and Chang, K. C. (2007). 
Accessing the deep Web. Communication of the ACM, volume 50, 
number 5, pages 94-101. 
[Hesham and Mustfa, 2005] Hesham, E. R. and Mustfa, A. (2005). Advanced 
Computer Architecture and Parallel Processing. John Wiley & Sons, Inc., 
Hoboken, New Jercy. 
[Higashi et al., 2004] Higashi, Y., Ata, S., Oka, I., and Fujiwara, C. (2004). 
Topology-aware Server selection method for Dynamic Parallel 
Downloading. In proceedings of the Consumer Communications and 
Networking Conference (CCNS), pages 325-330. 
125 
[Hortron, 1993] Hortron, G. (1993). A Multi-Level Diffusion Method for 
Dynamic Load Balancing. Journal of Parallel Computing, volume 19, 
pages 209-229. 
[Houle et al., 2002] Houle M., Symnovis, A., and Wood, D. (2002). 
Dimension-exchange algorithms for load balancing on trees. In 
proceedings of international Colloquium on Structural Information and 
Communication Complexity, Greece, pages 181-196. 
[Hsu et al., 1996] Hsu, W. J., Chung, M. J., and Hu, Z. (1996). Gaussian 
Networks for Scalable Distributed Systems. The Computer journal, 
volume 39, number 5, pages 417-426. 
[Hsu et al., 2000] Hsu, T., Lee, J. C, Lopez, D. R., and Royce, W. A. (2000). 
Task Allocation on a network of Processors. IEEE Transaction on 
Computers, volume 49, number 12, pages 1339-1352. 
[Hwang and Briggs, 1985] Hwang, K. and Briggs, F. A. (1985). Computer 
Architecture and Parallel Processing. McGraw-Hill International Edition. 
Singapore. 
[Hwang, 2001] Hwang, K. (2001). Advanced Computer Architecture. (2001). 
Tata McGraw-Hill, New York. 
[Imani and Azad, 2007] Imani, N. and Azad, H. S. (2007). Perfect Load 
balancing on the star interconnection network. Journal of Supercomputer, 
volume 41, pages 269- 286. 
126 
[Ishfaq and Ghafoor, 1991] Ishfaq, A. and Ghafoor, A. (1991). Semi-
Dostributed Load Balancing For Massively Parallel Multicomputer 
Systems. IEEE Transaction on Software Engineering, volume 17, number 
10, pages 987-1004. 
[Jin et al., 2002] Jin, S., Bestavros, A., and Iyengar, A. (2002). Accelerating 
internet streaming media delivery using network aware partial caching. In 
Proceedings of the 22"** IEEE International Conference on Distributed 
system, Pages 153-160. 
[Karrer and Knightly, 2005] Karrer, R. P. and Knightly, E. W. (2005). TCP-
PARIS: A parallel download protocol for replicas. In Proceedings of lO"' 
International Workshop on Web contents Caching and Distribution, pages 
15-25. 
[Kim and Veidenbaum, 1999] Kim, S. and Veidenbaum, A. V. (1999). 
Interconnection network organization and its impact on performance and 
cost in shared memory multiprocessors. Journals of parallel computing, 
volume 25, pages 283-309. 
[Kim and Veidenbaum, 1999] Kim, S. and Veidenbaum, A. V. (1999). 
Intercormection network organization and its impact on performance and 
cost in shared memory multiprocessors. Journal of Parallel Computing, 
volume 25, pages 238-309. 
127 
[Koo et al., 2003] Koo S. G. M., Rosenberg, C, and Xu, D. (2003). Analysis of 
parallel downloading for large file distribution. In Proceedings of 9th 
IEEE Workshop on future trends of distributed computing systems 
(FTDCS'03), pages 128-136. 
[Kumar and Patnaik, 1992] Kumar, M. and Patnaik, L. M. (1992). Extended 
Hypercube: A hierarchical interconnection network of hypercube. IEEE 
Transaction on Parallel and Distributed Systems, volume 3, number 1, 
pages 45-57. 
[Kumar et al., 1991] Kumar, V., Ananth, G., Y., and Rao. N., V. (1991). 
Scalable load balancing techniques for parallel computers. Technical 
Report, Department Of Computer Science, University of Minnesota, 
USA, pages 91-55. 
[Kwai and Parhami, 2004] Kwai, D. M. and Parhami, B. (2004). Incomplete K-
ary n-cube and its Derivatives. Journal of Parallel and Distributed 
Computing, volume 16, nvimber 2, pages 183-190. 
[Kwak and Jhon, 2005] Kwak J. W. and Jhon, C. K. (2005). Performance 
Evaluation of Modified hierarchical Ring by Exploiting Link utilization 
and Memory Access Locality. IEEE International Symposium on signal 
processing and Information Technology, Pages 82-87. 
[Kwak and Jhon, 2007] Kwak, J. W. and Jhon, C. S. (2007). Torus Ring: 
improving performance of interconnection network by modifying 
hierarchical ring. Journals of parallel computing, volume 33, pages 2-20. 
128 
[Latifi and Amawy, 1991] Latifi, S. and Amawy, A. (1991). Properties and 
Performance of Folded Hypercube. IEEE Transaction on Parallel and 
Distributed Systems, volume 2, number 1, pages 31-42. 
[Lee and Zomaya, 2008] [Lee, Y. C. and Zomaya, A. Y. (2008). A Novel State 
Transition method for Metaheuristic-Based Scheduling in Hetrogeneous 
Computing Systems. IEEE Transaction on Parallel and Distributed 
Systems, volume 19, number 9, pages 1215-1223. 
[Leighton et al., 1992] Leighton, F. T. (1992). Introduction to Parallel 
Algorithms and Architectures: Arrays, Tress and Hypercubes, Morgan 
Kaufmann. 
[LeMair and Reeves, 1990] LeMair, M. H. W. and Reeves, A. P. (1990). Local 
verses Global Strategies for Dynamic Load Balancing. In Proceedings of 
International Conference on Parallel Processing, volume 1, pages 569-
570. 
[LeMair and Reeves, 1993] LeMair, W., M., H. and Reeves, A. P. (1993). 
Strategies for dynamic load balancing on highly parallel computers. IEEE 
Transaction on Parallel and Distributed Systems, volume 4, number 9, 
pages 979-92. 
[Leung and Li, 2006] Leung, K. C. L. and Li, V. O. K. (2006). A Paracasting 
Model for Concurrent Access to Replicated Internet Content. IEEE 
Transaction on Multimedia, volume 8, number 1, pages 90-100. 
129 
[Li and Kamede, 1998] Li, J. and Kamede, H. (1998). Load balancing problems 
for multi-class Jobs in Distributed/Parallel Computer Systems, IEEE 
Transaction on Computers, volume 47, number 3, pages 322-332. 
[Lin and Raghavendra, 1992] Lin, H. C. and Raghavendra, C. S. (1992). A 
Dynamic Load balancing Policy with a control Job dispatcher (LBC). 
IEEE Transaction on Software Engineering, volume 18, number 2, pages 
148-158. 
[Louri and Sung , 1994] Louri, A. and Sung, H. (1994). Scalable Optical 
hypercube-Based Interconnection Network for massively parallel 
computing. Applied Optics, volume 33, pages 7588-7598. 
[Machida et al., 2008] Machida, Y., Takizawa, S. I., Nakada, H., and 
Matsuoka, S. (2008). Intelligent data staging with overlapped execution of 
grid applications. Future Generation Computer System, volume 24, pages 
425-433. 
[Malluhi and Bayoumi, 1994] Malluhi, Q. M. and Bayoumi, M. A. (1994). The 
Hierarchical Hypercube: A New Interconnection Topology for Massively 
Parallel Systems, IEEE Transaction on Parallel and Distributed Systems, 
volume 5, number 1, pages 17-30. 
[Mano, 2003] Mano, M. M. (2003). Computer System Architecture (3"* 
edition). Printice-Hall of India Private limited, New Delhi. 
130 
[Meraij, et al., 2007] Meraij, S., Nayebi, A., and Azad, H. S. (2007). Empirical 
Performance Evaluation of Adaptive Routing in necklace Hypercube: A 
Comparative Study. In Proceedings of the IEEE International Conference 
on Computing, (ICCTA'07), Pages 193-197. 
[Monemizadeh and sarbazi-Azad, 2005] Monemizadeh, M. and Sarbazi-Azad, 
H. (2005). The necklace hypercube: A well scalable hypercube-based 
interconnection network for multiprocessors. ACM SAC, pages 729-733. 
[Narayan and Opartny, 1999] Narayan, L. and Opartny, J. (1999). Compact 
Routing on Chordal Rings of Degree 4. Algorithmica, volume 23, pages 
72-96. 
[Nehra et al., 2007] Nehra, N., Patel, R. B., and Bhat, V. K. (2007). A 
Framework for Distributed Dynamic Load Balancing in Heterogeneous 
Cluster. Journal of Computer Science, volume 3, number 1, pages 14-24. 
[Nehra et al., 2007] Nehra, N., Patel, R. B., and Bhat, V. K. (2007). A 
framework for Distributed Dynamic Load Balancing in Heterogeneous 
Cluster. Journal of Computer Science, volume 3, number 1, pages 14-24. 
[Neocleous et al., 1998] Neocleous, C.,Weech, B., and Louri, A. (1998). A 
Spanning multi-channel linked hypercube: a gradually scalable optical 
interconnection network for massively parallel computing. IEEE 
Transaction on Parallel and Distributed Systems, volume 9, number 5, 
pages 497-512. 
131 
[Ng et al, 2003] Ng, T. S. E., Chu, Y. H., Rao, S. G, Sripanidkulchai, k., and 
Zhang, H. (2003). Measurement-based peer-to-peer systems. In 
Proceedings of IEEE INFOCOM 2003, San Francisco, CA, volume 3, 
pages 2199-2209. 
[Pai et al, 1999] Pai V.S., Druschel, P., and Zwaenepoel, W. (1999). Flash: An 
Efficient and Portable Web Server. In Proceedings of the USENIX 99 
Annual Technical conference. 
[Parhami and Kwai, 1999] Parhami, B. and Kwai, D. M. (1999). Periodically 
Regular Chordal Rings. IEEE Transaction on Parallel and Distributed 
Systems, volume 10, number 6, pages 658-672. 
[Parhami, 2000] Parhami, B. (2000). Challenges in interconnection network 
design in the era of multiprocessor and massively parallel Microchips. In 
Proceedings of International Conference of communication in computing, 
pages 241-246. 
[Park and Choe, 2002] Park, C. I. and Choe, T. Y. (2002). An Optimal 
Scheduling Algorithm based on Task Duplication. IEEE Transaction on 
Computer, volume 51, number 4, pages 444-448. 
[Patel et al., 2000] Patel, A., Kasalik A., and McCrosky, C. (2000). Area-
Efficient VLSI Layout for binary Hypercube. IEEE Transaction on 
Computers, volume 49, number 2, pages 160-169. 
132 
[Peter et al., 2005] Peter, K. K., Hsu, W. J., and Pan, Y. (2005). The Exchanged 
Hypercube. IEEE Transaction on Parallel and Distributed Systems, 
volume 16, number 9, pages 866-874. 
[Petkov, 1992] Petkov, N. (1992). Systolic Parallel Processing. North Holland 
Publishing Company. 
[Philopoulos and Maheswaran, 2001] Philopoulos, S. and Maheswaran, M. 
(2001). Experimental study of Parallel Downloading Schemes for Internet 
Mirror Sites. In Proceedings of 13th lASTED International Conference on 
Parallel and Distributed System (PDCS 01), pages 44-48. 
[Qin et al, 2003] Qin, X., Jiang, H., Zhu, Y., and Swanson, D. (2003). A 
dynamic load balancing scheme for I/O intensive applications in 
distributed systems. In Proceedings of the 32"'' International Conference 
on Parallel Processing Workshop (ICPPW'03), pages 79-86. 
[Quinn, 2002] Quiim, M. J. (2002) Parallel Computing: Theory and Practices. 
Tata McGraw-Hill Publishing Company Limited (Second Edition), New 
Delhi. 
[Rafiq et al., 1999] Rafiq, M. Q., Padam, K., and Gupta, J. P. (1999). A Novel 
Tree-Structured Multiprocessor Network. In Proceedings of International 
Conference of on Robotics Vision and Parallel Processing for 
Automation, Malaysia, volume 2, pages 576-585. 
133 
[Rafiq, 1995] Rafiq M. Q. (1995). Studies on the performance evaluation of a 
Linearly Extensible Multiprocessor network. Ph.D thesis, University of 
Roorkee (IIT), Roorkee, India. 
[Rafiquzzaman and Chandra, 1997] Rafiquzzaman, M. and Chandra, N. (1992). 
Modem Computer Architecture, Gagotia Pvt. Ltd, New delhi, India. 
[Ranjan and Knightly, 2008] Ranjan, S. and Knightly, E. (2008). High-
Performance Resource allocation and Request Redirection Algorithm for 
Web Clusters. IEEE Transaction on Parallel and Distributed Systems, 
volume 19, number 9, pages 1186-1200. 
[Ranjan et al., 2004] Ranjan, S., Karrer, R., and Knightly, E. (2004). Wide area 
redirection of dynamic content by Internet Data enters. In Proceedings of 
IEEE INFOCOM 2004, Hong Kong, China, volume 2, pages 816-826. 
[Ravikanth et al., 1988] Ravikanth, K., Sastry, P. S., and Ramakrishanan, K. R. 
(1988). A reduction architecture for the optimal scheduling of binary 
trees. Future Generation Computer Systems, volume 4, pages 225-233. 
[Ravikanth et al., 1990] Ravikanth, K., Sastry, P. S., and Venkatesh, Y. U. 
(1990). Simulation studies on the performance on an organizational model 
for graph reduction. Future Generation Computer Systems, volume 6, 
pages 163-180. 
[Reddy, 1993] Reddy, A. V. (1993). Design of parallel reduction model for the 
implementation of functional languages. Ph.D Thesis, IIT, Roorkee. 
134 
[Rodriguez and Biersack, 2002] Rodriguez, P. and Biersack, E. (2002). 
Dynamic Parallel Access to Replicated Content in the Internet. 
IEEE/ACM Transactions on Networking, volume 10, number 4, pages 
455-465. 
[Rodriguez et al., 2000] Rodriguez, P., Kirpal A. and Biersack. (2000). 
Parallel-access for mirror sites in the Internet. In proceedings of 
INFOCOM, volume 2, pages 864-873. 
[Saad and Shultz, 1988] Saad, Y. and Shultz, M. (1988). Topological properties 
of hypercubes. IEEE Transaction on Computer, volume 37, number 7, 
pages 867-871. 
[Salleh et al., 2002] Salleh, S., Aziz, N. A. B., Azmee, N. A., and Mohamaed, N. H. 
(2002). Dynamic Multiprocessor Scheduling Model for the reconfigurable mesh 
computing network. Journal of Technology, volume 37, pages 55-66. 
[Samad and Rafiq, 2005] Samad, A. and Rafiq, M. Q. (2005). A Novel Server 
Architecture for networking. In proceedings of International Conference 
on Robotics, Vision, Information and Signal processing, (ROVISP2005), 
University Sains, Malaysia, pages 1029-1032. 
[Samad et al., 2008] Samad, A., Rafiq, M. Q., and Farooq, O. (2008). A Novel 
Algorithm for Fast Retrieval of Information from A Multiprocessor 
Server. In Proceedings of 7* WSEAS International Conference on 
Software Engineering, Parallel and Distributed Systems (SEPADS '08), 
University of Cambridge, UK, pages 68-73. 
135 
[Samad et al., 2009] Samad, A., Rafiq, M. Q., and Farooq, O. (2009). Effective 
Information Balancing on a Multiprocessor Server. In proceedings of 
IEEE International Advanced Computing Conference (IACC'09), Patiala, 
India, pages 1215-1219. 
[Samad et al., 2010] Samad, A., Rafiq, M. Q., and Farooq, O. (2010). LEC: An 
efficient scalable parallel interconnection network. Accepted for 
presentation in the International Conference on Emerging Trends in 
Computer Science, Communication and Information Technology 
(CSCIT2010), to be held at Nanded, India, from 09-11 Jan., 2010. 
[Samathan and Pradhan, 1989] Samathan, M. R. and Pradhan, D. K. (1989). 
The de Bruijn multiprocessor network: A versatile parallel processing and 
sorting network for VLSI. IEEE Transaction on Computers, volume 38, 
number 4, pages 561-581. 
[Sayal et al., 1998] Sayal, M., Breitbart, Y., Scheuermann, P., and Vingralek, 
R. (1998). Selection Algorithms for Replicated Web Servers. ACM 
SIGMETRICS Perform. Eval. Rev., volume 26, pages 44-50. 
[Sharma et al., 2008] Sharma S., Singh S., and Sharma M. (2008). Performance 
Analysis of Load Balancing Algorithms. In Proceedings of World 
Academy of Science, Engineering and Technology, volume 28, pages 
269-272. 
136 
[Shen and Xu, 2009] Shen, H. and Xu, S. (2009). Coordinated En-Route Web 
Caching in Multiserver Networks. IEEE Transaction on Computers, 
volume 58, number 5, pages 605-619. 
[Shi and Srimani, 2005] Shi, W. and Srimani, P. K. (2005). Hierarchical star: a 
two level interconnection network. Journal of system Architecture, 
volume 51, pages 1-14. 
[Shi et al., 2002] Shi, W., Wright, R., Collins, E. and Karamcheti, V. (2002). 
Workload Characterization of a personalized Web site and its implications 
for dynamic content caching. In Proceedings of 7th International 
Conference of on Web Content Caching and Distribution (WCW'02/ 
[Shivaratri et al., 1992] Shivaratri, N.G., Krueger, P., and Singhal, M. (1992). 
Load Distribution for locally Distributed Systems. IEEE Transaction on 
Computer, volume 25, number 12, pages 33-44. 
[Stewart and Xiang, 2008] Stewart, I. A. and Xiang, Y. (2008). Embedding 
Long Paths in K-ary N-Cubes with Faulty Nodes and Links. IEEE 
Transaction on Parallel and Distributed Systems, volume 19, number 8, 
pages 1071-1085. 
[Subrata et al., 2008] Subrata, R., Zomaya, Y. A., and landfeldet, B. (2008). A 
Cooperative Game Framework for QOS Guided Job Allocation Schemes 
in Grids. IEEE Transaction on Computer, volume 57, number 10, pages 
1413-1422. 
137 
[Towles and Dally, 2004] Towles, B. and Dally, W. (2004). Principles and 
Practices of Interconnection Network, Morgan Kaufmann Press, San 
Francisco. 
[Tse, 2009] Tse, S. S. H. (2009). Online Bicriteria Load balancing Using 
Object Reallocation. IEEE Transaction on Parallel and Distributed 
Systems, volume 20, number 3, pages 379-388. 
[Vuillemin and Preparata, 1981] Vuillemin, J. and Preparata, F. P. (1981). The 
cube-connected cycles: A new Interconnection Topology for parallel 
computation. Communication of the ACM, volume 24, number 5, pages 
300-309. 
[Watts and Taylor, 1998] Watts, J. and Taylor, S. (1998). A Practical Approach 
to Dynamic Load Balancing. IEEE Transaction Parallel and Distributed 
Systems, volume 9, number 3, pages 235-248. 
[Xi-Dao et al., 2000] Xi-Dao, L., Yu-Xiang, Xie, Ling-Da, Wu, Chi-Long, M., 
Song-Yang, L. (2000). Information Assistant: A Novel initiative topic 
search engine. In proceedings of 4^** International Conference on Machine 
learning and Cybernetics, Gaungzhou, pages2363-2367. 
[Xu and Lau, 1992] Xu, C. Z. and Lau, F. C. M. (1992). Analysis of the 
Generalized Dimension Exchange Method for Dynamic Load balancing. 
Journal of Parallel and Distributing Computing, volume 16, pages 385-
393. 
138 
[Xu and Lau, 1995] Xu, C. Z. and Lau, F. C. M. (1995). The Generalized 
Dimension Exchange Method for Dynamic Load balancing in K-ary n-
Cubes and variants. Journal of Parallel and Distributing Computing, 
volume 24, pages 72-85. 
[Yagoubi and Slimani, 2006] Yagoubi, B. and Slimani, Y. (2006). Dynamic 
Load Balancing Strategy for Grid Computing. In Proceedings of World 
Academy of Science, Engineering and Technology, volume 13, pages 
260-265. 
[Yamin et al., 2001] Yamin L., Peng S., and Chu, W. (2001). Metacube: A 
New Intercoimection Network for Large Parallel System. ACSAC02, 
Australian Computer Science Communications, volume 24, number 4, 
pages 30- 36. 
[Yamin et al., 2004] Yamin L., Peng, S., and Chu, W. (2004). Efficient 
Collective Communications in Dual-cube. Journal of Super Computing, 
volume 28, pages 71-90. 
[Yang et al., 2007] Yang, J. S., Chang, J. M., Tang, S. M., and Wang, Y. L. 
(2007). Reducing the Height of Independent Spanning Trees in Chordal 
Rings. IEEE Transaction on Parallel and Distributed Systems, volume 18, 
number 5, pages 644-656. 
[Yao et al., 2008] Yao, J., Guo, J., and Bhuyan, L. (2008). Ordered Round 
Robin: An efficient Sequencing Preserving Packet Scheduler. IEEE 
Transaction on Computer, volume 57, number 12, pages 1690-1703. 
139 
[Yoo et al., 2000] Yoo, D., Park, I., and Maeng, S. R. (2000). Multistage ring 
network: An interconnection network for large scale shared memory 
multiprocessors. Journal of Systems Architecture, volume 47, pages 765-
778. 
[Youyao et al., 2008] Youyao, L., Jungang, H., and Huimin, D. (2008). A 
Hypercube-based Scalable Interconnection Network for Massively 
Parallel Computing. Journal of Computers, volume 3, number 10, pages 
58- 65. 
[Zaki et al., 1997] Zaki, M. J., Wei Li, and Parthasarathy, S. (1997). 
Customized Dynamic Load Balancing for a Network of Workstations. 
Journal of Parallel and Distributed Computing, number 43, pages 156-
162. 
[Zegura et al., 2000] Zegura, E. W., Ammar, M. H., Fei, Z., and Bhattacharjee, 
S. (2000). Application layers any casting: A Server selection architecture 
and use in replicated Web service. IEEE/ACM Transaction of 
Networking, volume 8, number 4, pages 455-466. 
[Zeng and Veeravalli, 2004]. Zeng, Z. and Bhardwaj, V. (2004). Design and 
Analysis of a non Pre-emptive Decentralized Load Balancing Algorithm 
for Multi class Jobs in Distributed Networks. Computer Comm., volume 
27, pages 679-694. 
140 
[Zeng and Veeravalli, 2006] Zeng, Z. and Veeravalli, B. (2006). Design and 
Performance Evaluation of Queue-and-Rate-Adjustment Dynamic Load 
Balancing Policies for Distributed Networks. IEEE Transaction on 
Computers, volume 55, number 11, pages 1410-1422. 
[Zeus, 2003] Zeus Web Server. (2003). Zeus Technology Limited, http://  
www.zeus.com. 
[Zhang, 2002] Zhang, Y. Q. (2002). Folded-crossed hypercube: a complete 
intercormection network. Journal of Systems Architecture, volume 47, 
pages 917-922. 
[Zimmerman and Esfahanian, 1992] Zimmerman, G. W. and Esfahanian, A. H. 
(1992). Chordal Rings as Fault-Tolerent Loops. Discrete Applied 
Mathematics, volume 37-38, pages 563- 573. 
141 
List of papers from PhD Thesis 
International Conference Papers 
[1]. [Samad et al., 2010] Samad, A., Rafiq, M. Q., and Farooq, O. (2010). 
LEC: An efficient scalable parallel interconnection network. Accepted 
for presentation in the International Conference on Emerging Trends in 
Computer Science, Communication and Information Technology 
(CSCIT2010), to be held at Handed, India, from 09-11 Jan., 2010. 
[2]. [Samad et al., 2009] Samad, A., Rafiq, M. Q., and Farooq, O. (2009). 
Effective Information Balancing on a Multiprocessor Server. In 
proceedings of IEEE International Advanced Computing Conference 
(IACC'09), Patiala, India, pages 1215-1219. 
[3]. [Samad et al., 2008] Samad, A., Rafiq, M. Q., and Farooq, O. (2008). A 
Novel Algorithm for Fast Retrieval of Information from a 
Multiprocessor Server. In Proceedings of 7* WSEAS International 
Conference on Software Engineering, Parallel and Distributed Systems 
(SEPADS '08), University of Cambridge, UK, pages 68-73. 
[4]. [Samad and Rafiq, 2005] Samad, A. and Rafiq, M. Q. (2005). A Novel 
Server Architecture for networking. In proceedings of International 
Conference on Robotics, Vision, Information and Signal processing, 
(ROVISP2005), University Sains, Malaysia, pages 1029-1032. 
142 
