Interconnection of Clusters of Various Architectures in Grid Systems by Ovidiu GHERMAN et al.
 
Journal of Applied Computer Science & Mathematics, no. 12 (6) /2012, Suceava 
 
 
  30
 
 
Interconnection of Clusters of Various Architectures in Grid Systems 
 
1Ovidiu GHERMAN, 
2Ioan UNGUREAN, 
3Ştefan G. PENTIUC 
University Stefan cel Mare of Suceava, Romania 
1ovidiug@usv.ro,
 2 ioanu@usv.ro,
 3 pentiuc@usv.ro 
 
Abstract-The future of computing seems to be parallel. The 
computers built with general purpose processors are superseded 
nowadays by systems built around processors with multiple 
cores that are designed to operate with massive amount of 
arithmetic operation. And more frequently these processors are 
able to implement an internal parallelism. A similar step was 
made by IBM when they proposed that a unit dedicated to 
arithmetical operations to be used as an accelerator node for a 
cluster controlled by nodes with general purpose architecture. A 
new approach has been made towards high performance 
computing (HPC), in the form of hybrid architectures. However, 
using machines with different architecture as a single system 
can pose problems in application deployment. The paper 
analyse these difficulties and proposes a procedure to implement 
a multi-level parallelism at the application level. The 
experimental results are discussed. 
 
Keywords: clusters, grid systems, PPE, SPE, SIMD 
 
I. INTRODUCTION 
 
The interconnection of parallel computing platforms in 
Grid systems always raises problems related to hardware 
infrastructure running the distributed application software. 
The large number of possible architectures and combinations 
of computer systems or types of processors has led to the 
development of management applications able to run on 
multiple architectures, so as to free the user from the need to 
take into account the architecture and type of each node 
on  which will run the tasks. Although the nature of the 
hardware nodes may influence the ability of interconnecting 
the resources, the use of middleware can help to standardize 
the access to the resources. One of the most known and used 
such environments is the Globus Toolkit, a product of the 
Globus Alliance, a community doing research on 
technologies, standards and grid systems in order to create a 
structure to enable distributed collaboration between different 
users - for scientific, engineering, economics, etc. 
This middleware implements some of well-defined 
standards in the world of grid technology such as OGSA, 
OGSI and GSI. Its tools allow planning tasks that will run on 
nodes, ensuring their distribution, security of data and 
applications, resources management, data transfer between 
nodes and others. 
The research presented in this paper was developed in the 
High Performance Computing Laboratory of the University 
Stefan cel Mare of Suceava. The main equipment of this 
laboratory is represented by 2 clusters of type IBM Blade 
Center. One of them has a Roadrunner architecture with 96 
processors PowerXCell 8i that has a real computing power of 
6.53 TFlops in double precision (evaluated with Linpack by a 
team of IBM Germany), and a storage capacity of 6.936 TB, 
and a second IBM Blade Center based on Xeon 
microprocessor having 0.5 TFlops and 3 TB storage capacity. 
One of the main objectives of this laboratory is to create a 
Grid environment which offers a set of services that 
encapsulate and virtualizes shared resources needed for 
solving advanced problems from pattern recognition and 
distributed computational intelligence. 
 
II. HYBRID ARCHITECTURES 
 
In the design of a distributed application a complicated 
problem arises when a cluster has hybrid architecture. An 
example of such a platform is the USV-Roadrunner cluster, 
Fig. 1, which is based on both x86 (CISC architecture) and 
the Cell BE processors based on an architecture similar to 
PowerPC processors (RISC). The structure of such a system 
allows a particular programming model. An application 
designed for this system should take into account the 
characteristics of both types of processors. The program will 
be designed so as to use the special capabilities of the 
processors to accelerate computing (PowerXCell 8i).  
 
 
 
Fig. 1. The cluster USV-Roadrunner has 6 LS22 servers for tasks 
management, 48 QS22 for data processing, and a storage system.  
Computer Science Section 
 
 
 
  31
The heterogeneous structure of Cell BE processors consists 
in nine internal nuclei [1] interconnected by a bus DMA 
controller. This powerful computing unit is controlled by one 
of the internal core, called PPE (PowerPC Processing 
Element). The Cell BE processor is augmented by a 
Synergistic Processing Element (SPE) that consists in eight 
units, single-instruction multiple-data (SIMD) processors, 
optimized for data-intensive operations.  The core is based on 
a 64-bit PowerPC architecture and has - in addition to the 
original instruction set compatible with similar architecture 
series processors - a specially designed instructions set for 
vector calculus. This extension set is called Vector / SIMD 
Multimedia Extension.  
The PPE mostly deals with process management and 
divide a problem into sub-problems easier to handle in the 
arithmetic units. This nucleus is the main processing element 
providing the interface between the eight subordinate 
processors and the control node. It requires a special compiler 
(for applications in C / C + + can usually use PPU-g + +, 
installed by default on this platform). 
The tasks for SPE are allocated by PPE because the SPE 
units cannot communicate directly with the outside. The SPE 
has a modular design, the main module being a Synergistic 
Processing Unit (SPU). Each SPU contains a core RISC with 
256 KB of memory, and an independent unit MFC (Memory 
Flow Controller) - a DMA controller that uses a memory 
management module to achieve synchronized operations with 
other SPU’s from the same chip or with PPU. SPU load 
instructions and data taken from a local storage system [4] 
and communicates via a dedicated channel built in MFC with 
the main memory or other local systems resident on the same 
physical chip. The SPU has a new instruction set architecture, 
designed by IBM Research, which uses a 7 bit operand for 
the direct address of 128 registers. In this aim there is a 
unified SIMD register file with 128 entries 128-bits for 
floating point or integer operations.  
 
Fig. 2. Programming technologies for a hybrid architecture [3]. 
SPE supports a special set of novel SIMD instructions 
specifically designed to accelerate floating-point arithmetic 
calculations and asynchronous DMA transfers. These 
transfers are aimed to move data and instructions between the 
main storage system (effective address space that includes 
main memory) and the local system storage.  
The SPE units are not made to run an operating system, so 
it depends on the PPE [2]. The advantage of these units is 
their performance in accelerating the vector calculations. In 
order to benefit from accelerators the user application must 
be optimized for these units.  
In general a complex computing problem that will be 
processed on a Cell BE is divided in some less complexes 
problems. These small problems are sent to be solved on the 
eight synergistic cores (SPE). The vector processors do not 
have much local memory (256KB of memory available) but 
work at high frequency (3.2 GHz) and receive DMA memory 
transfers. Due to this way of working, Cell BE processor has 
an outstanding performance when processing small data 
blocks. Therefore it is most effectively used when the 
primary application (running on core x86 AMD Opteron 
processor unit controlling the Cell BE) wants to quickly 
make a series of small granularity computations. Moreover, it 
is recommended that calculations involving data structures 
that can be processed as vectors (e.g. matrices to be 
processed line by line). This technique of "offloading" 
requires at the level of SPE and PPE cores the use of a library 
for Cell BE processor's acceleration (DACS and ALF). At the 
level of control units (Opteron) the MPI may be used for data 
transmission between programs (Fig. 2).  
Basically we have a double level of parallelization (of 
different granularity). This complicates the writing of a 
program designed to run on such architecture. The code 
sharing that is required between these two types of processors 
implies the partial rewriting of the source code for many 
existing applications. Beside this, it must take into account 
the peculiarities of Cell BE processor if you want to maintain 
performance (access to memory, work with data structures, 
data transfer mode, etc.). But when the hybrid system is used 
in conjunction with other platforms it is necessary to address 
a new technique to achieve the program [5]. 
 
III. INTERCONECTION OF VARIOUS ARCHITECTURES 
 
Adding new resources in a grid system may involve new 
computer units with classical architecture (x86 - AMD or 
Intel usually). Because the Cell BE processors can 
communicate only with a master processor (x86) 
from  receiving tasks, the user will have to structure the 
application so that  
  the processes instantiated on the master nodes will 
receive tasks that do not require intensive vector 
calculations or tasks that require manipulation of data 
structures (e.g. submission of tasks to processors for 
acceleration, handling input and  output,  CRC  checks,  
Journal of Applied Computer Science & Mathematics, no. 12 (6) /2012, Suceava 
 
 
  32
receiving input data, providing decomposition problems 
which can work on individual SPE cores),  
  the processes instantiated on Cell BE processors will 
accelerate the computing.  
Classical models of structuring parallel programs (master-
slave or cooperative) must be abandoned or redesigned to 
function properly, and to get maximum performance from the 
processors to accelerate. 
The HPC Lab. of the University Stefan cel Mare of 
Suceava dispose of a 6.53 TFlops cluster called USV-
Roadrunner. The cluster was tested and its performance has 
been validated with the Linpack test and the power 
consumption reported was very good. Its architecture is 
presented in Fig. 3. In terms of functional structure is a 
hybrid cluster with nodes optimized for the management of 
distributed applications. These nodes (LS22 dual-CPU) are 
not usually used in intensive calculations, but to the 
distribution of tasks and jobs.  
The environment for distributed applications uses MPI (the 
cluster is configured OpenMPI 1.3.3, installed on the 
operating system Red Hat Linux Enterprise Edition 5). The 
LS22 nodes are in communication with QS22 subordinate 
nodes (nodes of acceleration), each Opteron quad core 
processor managing three PowerXCell 8i processors. The 
accelerators from one blade cannot communicate directly 
with other accelerators from neighbouring blades, but only 
with the master processor via a PCI Express bus. The 
operating system may be either RHEL 5 if it is necessary a 
broad range of software tools for application programming on 
hybrid computers, or Fedora Core 9 a highly optimized 
environment for computing performance but poor in software 
development tools. These two operating systems 
are interchangeable.   
Since the acceleration processors cannot communicate 
directly with other processors in the structure calculation, but 
only through the host processor, they will depend on them to 
get computing tasks. This technique basically involves 
parallelization in parallel programming, and can be viewed 
either in terms of unit CELL ("Cell-centric") or from the host 
processor ("host-centric") [3] [6].  
 
 
 
Fig. 3. Hybrid structure of the cluster "USV-Roadrunner". 
In the cluster we can use either of two ways to write 
software, because the management nodes usually do not 
contribute to the computation effort, but to the management 
of the accelerators units, so we will not have a choice of three 
program compilation, but only two. 
In highly distributed applications on the grid the host-
centric approach is desirable as it provides more benefits than 
the other alternative. In this case, the user that sends a task 
for processing, if he wishes to use the accelerated resources 
and general resources (i.e. other clusters or computing units 
connected to the grid), he will need to adapt its programs to 
be compiled / run on all platforms available. Of course it may 
be considered difficult such a distributed system 
programming procedure, but the real problem resides in the 
development tools for PowerPC architecture. Testing to 
detect the architecture of a node is relatively simple 
compared to the problems posed by the effort to write source 
code for a double parallelization.  Therefore a solution that 
solves this apparent problem is the decomposition method 
depending on the complexity of calculations claimed. 
For example, a problem involving large data sets that 
require formatting and data processing can be done in two 
distinct phases: 
a)  send the primary format files or data to the general-
purpose system processors; 
b)  the intensive-calculation jobs will be sent to the 
accelerator processors; these jobs must be 
designed according to the local storage units deployed in 
the small SPE in order to apply the necessary arithmetic 
operations.  
This can be an advantage if it uses a system of parallel 
access files concurrently (such as GPFS, installed on USV-
Roadrunner cluster). In this case, several processors may 
concurrently access different and various parts of the file, and 
then they signal the acceleration units to finish the processing. 
To maintain a high processing speed it is important to use 
data structures that are suitable for pipeline processing unit of 
Cell structure (e.g. linear processed data which do not require 
previous values). But any problem involves also management, 
organization, services delivery, for which the unit with 
common architectures (e.g. x86 CISC) are better and faster 
than accelerators. Although this style of programming is 
started, the tendency to use such units (multicore vector 
processing) increases exponentially. Because of this, IBM is 
trying to expand their libraries on hybrid and non-hybrid 
systems (ALF, DACS), in order to avoid compatibility 
problems. It is possible to assist to a closer competition with 
other distributed programming environments such as MPI [3].  
Due to the hybrid architecture, the programming 
distributed applications on heterogeneous nodes must be 
done according to the nature of the platform on which the 
application must run. If for the processors x86 the compilers 
and libraries are standard ones, and they come with the  MPI 
install kit, The Cell processor must use two compilers, in 
order to exploit the code both for PPU and SPU units.   
Computer Science Section 
 
 
 
  33
 
Fig. 4. Interconnection of computing nodes (at MPI). 
 
For this reason, it is necessary to use a wrapper to execute 
more complex operations of copying and compiling in the 
grid, but it is also needed to write optimized code for 
acceleration units, leading to a much higher performance.  
An experimental distributed system interconnects USV-
Roadrunner cluster which is based on 48 computing nodes 
EDP Cell BE processors, as was mentioned above, and the 
other cluster of HPC Lab. having 28 nodes based on Intel 
Xeon processors for general use. Both clusters have external 
connections (Gigabit Ethernet) and a management server to 
allow connection to external network computing nodes [7]. 
Computing nodes are interconnected and communicate at the 
application level via MPI messages (Fig. 4). It must be taken 
into account the working platform of the nodes in order to 
assure the proper execution of the application. This can 
become problematic for unknown systems or heterogeneous 
resources (a discovery manager should be used). 
In the experimental system, the communication network is 
also a heterogeneous resource. The Xeon processor-based 
cluster is connected internal and external through a Gigabit 
Ethernet network, and the USV-Roadrunner uses Gigabit 
Ethernet for external connections and Infiniband protocol for 
internal networking. That brings an increase of performance 
for USV-Roadrunner in terms of data transfer and better 
scalability when using a larger number of compute nodes. 
A procedure that allows the utilization of both types of 
systems as a whole can help the programmer to compile and 
run any application more quickly and from a single source. A 
script that implements such a procedure should be able to do 
the following: 
- transfer source files to the remote computer unit; 
- compile source files on remote resources (using specific 
compilers); 
- starts the application running on both resources, 
specifying the nodes that will run its instances; 
- centralize the results. 
If the source code uses only functions from the MPI library, 
the resulting application can run on general purpose nodes, 
and on acceleration nodes, without the need to write special 
code for each type of units. But in this case it would not be 
used the accelerated synergistic units, which will lead to an 
incomplete operational capacity (the Cell BE processors will 
work in "brute" mode). But the available computing power 
should be exploited to solve various problems. 
 
IV. EXPERIMENTAL RESULTS 
 
An implementation of such a script that helps to 
interconnect two clusters shows that the interconnection 
application is feasible. On the USV-Roadrunner was used 
only 4 processors Cell BE processors, while on the Xenon-
based cluster were used 28. It was used as test program an 
application that calculates the product of two matrices. 
The first system that benefits from Infiniband connections 
contains several components that extend the time in which a 
small application is distributed, becoming efficient when an 
increasing amount of data is provided. Its features, like high 
frequency operations, the DMA accesses and pipeline 
structure increase the system performance from a certain 
point where the data overhead is minimized by its high 
volume. A graph showing the variation of execution time 
depending on the size of the problem on different 
architectures is shown in Fig. 5. 
The USV-Roadrunner having Cell BE, a more suitable unit 
operation, has a better performance observed with the 
increasing volume of data processed. It can be observed that 
at small number of elements its execution time is greater than 
those obtained on the Xeon-based cluster. 
An advanced approach of this application would allow the 
decomposition problem in accordance with different types of 
tasks, which then will be sent to the suitable cores of the 
cluster. 
This method requires additional programming to create 
three types of programs for SPU units, PPU units, and for the 
x 86 platforms. In addition, a script is used to compile the 
application simultaneously on all machines, and then running 
on the specified nodes. 
The management nodes will not usually run distributed 
applications (MPI level) to  avoid being congested with 
additional tasks.  
 
V. CONCLUSIONS 
 
The USV-Roadrunner hybrid cluster is a powerful system 
for HPC tasks, which benefits from the latest technology. 
Scalability of this particular type of system was proved on a 
much higher system owned by the Los Alamos laboratories. 
It was shown that the scaling is almost linear even on a 
system with over 10,000 processors [3]. This leads to the 
possibility of extending the number of units in a simple 
manner and without infrastructure changes by adding QS22 
and LS22 blades in the proportion of 1:6. 
4  
Journal of Applied Computer Science & Mathematics, no. 12 (6) /2012, Suceava 
 
 
  34
 
 
Fig. 5. Variation of computing performance in function of the volume of 
data processed (compared architectures). 
 
Also, this type of system is very efficient in terms of 
energy consumption, surpassing all other similar platforms 
performance [8]. The energetic efficiency of this hybrid 
system is 437 MFlops/watt, compared to equivalent systems 
whose efficiency usually does not exceed 378 MFlops/watt. 
This paper shows that, although an application can use the 
synergistic processors in a general manner (as a simple CPU, 
in “brute” calculating mode, without resorting to acceleration 
libraries) and can be distributed as a grid system [11], [12], 
but it will not achieve the best performance. No advantages 
will be obtained. 
If the aim is to obtain a significant acceleration of the 
calculations, in various applications such in [13],[14], it is 
necessary to provide code for both architectures (platforms), 
for general purpose and for the hybrid one. The disadvantage 
of this practice consists in the fact that is required practically 
to write two applications (which actually lead to increased 
development time) and compiling them separately for each 
architecture. 
A possible solution is to use a programming method which 
will divide the problem to solve in sub-problems according to 
their nature, and then sent them to each processor specialized 
in solving that type of problem [9],[10]. 
For example, all issues involving working with files, 
formatting, and queries, i.e. everything not related to brute 
calculation can be delegated to the classical architecture, but 
to the accelerators is sent vector or matrix calculations. The 
generic processors will decompose the matrix into vectors 
that are sent to accelerator units which will process them, and 
the generic processors will assemble them in the end, 
maximizing the benefit from running a matrix problem on 
such hybrid architecture.  
The specialized nature of Cell BE processors makes this 
type of connection to have significant advantages in certain 
mathematical problems (in which nodes built on general 
purpose architectures do not develop the same performance). 
But for problems that do not use such intensive mathematical 
routines, the accelerator nodes may even be limited in certain 
respects (lack of a large local memory, for example.).  
ACKNOWLEDGMENTS 
 
The research is developed in the HPC Lab. of the 
University Stefan cel Mare of Suceava. The main equipment, 
the hybrid cluster USV-Roadrunner, was purchased through a 
financial grant obtained in the Romanian national 
competition PN II, Capacities program, Module I, by the 
project entitled "Grid for developing pattern recognition and 
artificial intelligence distributed applications - GRIDNORD", 
80/13.09.2007.  
 
REFERENCES 
 
[1]  T. Chen, R. Raghavan, J. Dale, E. Iwata, “Cell Broadband 
Engine Architecture” , J. of Research and Develop. IBM , vol. 
51, no. 5, 2007. 
[2]  ****, “Cell Broadband Engine – Programming Handbook 
v1.1”, IBM Systems and Technology Group, https://www-
01.ibm.com/chips/techlib/techlib.nsf/techdocs/7A77CCDF14F
E70D5852575CA0074E8ED. 
[3]  K. Koch, “Roadrunner Platform Overview”, Roadrunner 
Technical Seminar Series, Los Alamos National Laboratory, 
March 2008. 
[4]  ****, “SPU Application Binary Interface Specification v1.9”, 
CBEA JSRE Series, Cell Broadband Engine Architecture, Join 
Software Reference Environment Series, July 2008. 
[5]  C. Kessler, “Programming Techniques for the CELL 
Processor”, Kista, Suede, Multicore Day 2009. 
[6]  J. A. Tuner, “The Los Alamos Roadrunner Petascale Hybrid 
Supercomputer – Overview of Applications”, Results and 
Programming, Roadrunner Technical Seminar Series, Los 
Alamos National Laboratory, March 2008. 
[7]  B. M. Bode, J.J. Hill, T. R. Benjegerdes, “Cluster Interconnect 
Overview”, Proceedings of the annual conference on USENIX 
Annual Technical Conference, Boston, 2004 
[8]  I. Ungurean, S. G. Pentiuc, V. Gaitan. “Performance 
Evaluation of an Experimental Grid Computer using MPI 
Applications”, Electronics and Electrical Engineering. Kaunas, 
2009. –No. 5(93). – pp. 55–58. 
[9]  V. Pilkauskas, R. Plestys, G. Vilutis, D. Sandonavicius. 
“Improvement of WMS Functionality, Aiming to Minimize 
Processing Time of Jobs in Grid Computing”, Electronics and 
Electrical Engineering.Kaunas, No. 7(113). pp. 111–116, 2011. 
[10] K. Sutiene, G. Vilutis, D. Sandonavicius. “Forecasting of 
GRID Job Waiting Time from Imputed Time Series”, 
Electronics and Electrical Engineering. Kaunas,  No. 8(114). 
pp. 101–106, 2011. 
[11] O. Brudaru, D. Popovici, C. Copaceanu, "Cellular Genetic 
Algorithm with Communicating Grids for Assembly Line 
Balancing Problems," Advances in Electrical and Computer 
Engineering, vol. 10, no. 2, pp. 87-93, 2010. 
[12] C. Aflori, M. Craus "Grid Implementation of the Apriori 
Algorithm", Advances in Engineering Software, Elsevier Ltd., 
pp. 295-300 Volume 38, Issue 5, 2007. 
[13] V. Lupu, D. E. Tiliute, “The Communication in Distributed 
Client - Server Systems Used for Management of Flexible 
Manufacturing Systems”, Int. J. of Computers, 
Communications & Control, Vol. VI (2011), No. 2 (June), pp. 
297-304.  
Computer Science Section 
 
 
 
  35
[14] R.D. Vatavu, “Understanding Challenges in Designing 
Interactions for the Age of Ambient Media”, The 3rd Workshop 
on Semantic Ambient Media Experience (SAME) at AmI 2010, 
pp. 8-13, Malaga, Nov. 2010 
 
 
Gherman Ovidiu graduated from “Stefan cel Mare” University of Suceava in 2004 and now he is a teaching assistant at the same 
university, Faculty of Electrical Engineering and Computer Science. His research interests include Parallel and Distributed Systems. 
 
Ungurean Ioan received the PhD degree in Computer Science at “Stefan cel Mare” University of Suceava in 2011. He is currently a 
teaching assistant in the Departament of Computer, Electronics and Automation at “Stefan cel Mare” University of Suceava. In 2009, he was 
involved to the establishment of a High Performance Computing laboratory at his University 
 
Ştefan Gheorghe Pentiuc is professor at the “Stefan cel Mare” University of Suceava, Faculty of Electrical Engineering and Computer 
Science. His research interests include Pattern Recognition, Parallel and Distributed Systems. 