Search CORE

7 research outputs found

MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface

Author: Foster I.
Karonis N. T.
Toonen B.
Publication venue
Publication date: 01/01/2002
Field of study

Application development for distributed computing "Grids" can benefit from tools that variously hide or enable application-level management of critical aspects of the heterogeneous environment. As part of an investigation of these issues, we have developed MPICH-G2, a Grid-enabled implementation of the Message Passing Interface (MPI) that allows a user to run MPI programs across multiple computers, at the same or different sites, using the same commands that would be used on a parallel computer. This library extends the Argonne MPICH implementation of MPI to use services provided by the Globus Toolkit for authentication, authorization, resource allocation, executable staging, and I/O, as well as for process creation, monitoring, and control. Various performance-critical operations, including startup and collective operations, are configured to exploit network topology information. The library also exploits MPI constructs for performance management; for example, the MPI communicator construct is used for application-level discovery of, and adaptation to, both network topology and network quality-of-service mechanisms. We describe the MPICH-G2 design and implementation, present performance results, and review application experiences, including record-setting distributed simulations.Comment: 20 pages, 8 figure

arXiv.org e-Print Archive

CiteSeerX

Dynamic Load Balancing of Samr Applications on Distributed Systems

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2002
Field of study

Crossref

Technologies and tools for high-performance distributed computing. Final report

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

Vcluster: A Portable Virtual Computing Library For Cluster Computing

Author: Zhang Hua
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2008
Field of study

Message passing has been the dominant parallel programming model in cluster computing, and libraries like Message Passing Interface (MPI) and Portable Virtual Machine (PVM) have proven their novelty and efficiency through numerous applications in diverse areas. However, as clusters of Symmetric Multi-Processor (SMP) and heterogeneous machines become popular, conventional message passing models must be adapted accordingly to support this new kind of clusters efficiently. In addition, Java programming language, with its features like object oriented architecture, platform independent bytecode, and native support for multithreading, makes it an alternative language for cluster computing. This research presents a new parallel programming model and a library called VCluster that implements this model on top of a Java Virtual Machine (JVM). The programming model is based on virtual migrating threads to support clusters of heterogeneous SMP machines efficiently. VCluster is implemented in 100% Java, utilizing the portability of Java to address the problems of heterogeneous machines. VCluster virtualizes computational and communication resources such as threads, computation states, and communication channels across multiple separate JVMs, which makes a mobile thread possible. Equipped with virtual migrating thread, it is feasible to balance the load of computing resources dynamically. Several large scale parallel applications have been developed using VCluster to compare the performance and usage of VCluster with other libraries. The results of the experiments show that VCluster makes it easier to develop multithreading parallel applications compared to conventional libraries like MPI. At the same time, the performance of VCluster is comparable to MPICH, a widely used MPI library, combined with popular threading libraries like POSIX Thread and OpenMP. In the next phase of our work, we implemented thread group and thread migration to demonstrate the feasibility of dynamic load balancing in VCluster. We carried out experiments to show that the load can be dynamically balanced in VCluster, resulting in a better performance. Thread group also makes it possible to implement collective communication functions between threads, which have been proved to be useful in process based libraries

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Estrategias de descomposición en dominios para entornos Grid

Author: Otero Calviño Beatriz
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2007
Field of study

En este trabajo estamos interesados en realizar simulaciones numéricas basadas en elementos finitos con integración explícita en el tiempo utilizando la tecnología Grid.Actualmente, las simulaciones explícitas de elementos finitos usan la técnica de descomposición en dominios con particiones balanceadas para realizar la distribución de los datos. Sin embargo, esta distribución de los datos presenta una degradación importante del rendimiento de las simulaciones explícitas cuando son ejecutadas en entornos Grid. Esto se debe principalmente, a que en un ambiente Grid tenemos comunicaciones heterogéneas, muy rápidas dentro de una máquina y muy lentas fuera de ella. De esta forma, una distribución balanceada de los datos se ejecuta a la velocidad de las comunicaciones más lentas. Para superar este problema proponemos solapar el tiempo de la comunicación remota con el tiempo de cálculo. Para ello, dedicaremos algunos procesadores a gestionar las comunicaciones más lentas, y el resto, a realizar cálculo intensivo. Este esquema de distribución de los datos, requiere que la descomposición en dominios sea no balanceada, para que, los procesadores dedicados a realizar la gestión de las comunicaciones lentas tengan apenas carga computacional. En este trabajo se han propuesto y analizado diferentes estrategias para distribuir los datos y mejorar el rendimiento de las aplicaciones en entornos Grid. Las estrategias de distribución estáticas analizadas son: 1. U-1domains: Inicialmente, el dominio de los datos es dividido proporcionalmente entre las máquinas dependiendo de su velocidad relativa. Posteriormente, en cada máquina, los datos son divididos en nprocs-1 partes, donde nprocs es el número de procesadores total de la máquina. Cada subdominio es asignado a un procesador y cada máquina dispone de un único procesador para gestionar las comunicaciones remotas con otras máquinas. 2. U-Bdomains: El particionamiento de los datos se realiza en dos fases. La primera fase es equivalente a la realizada para la distribución U-1domains. La segunda fase, divide, proporcionalmente, cada subdominio de datos en nprocs-B partes, donde B es el número de comunicaciones remotas con otras máquinas (dominios especiales). Cada máquina tiene más de un procesador para gestionar las comunicaciones remotas. 3. U-CBdomains: En esta distribución, se crean tantos dominios especiales como comunicaciones remotas. Sin embargo, ahora los dominios especiales son asignados a un único procesador dentro de la máquina. De esta forma, cada subdomino de datos es dividido en nprocs-1 partes. La gestión de las comunicaciones remotas se realiza concurrentemente mediante threads. Para evaluar el rendimiento de las aplicaciones sobre entornos Grid utilizamos Dimemas. Para cada caso, evaluamos el rendimiento de las aplicaciones en diferentes entornos y tipos de mallas. Los resultados obtenidos muestran que:· La distribución U-1domains reduce los tiempos de ejecución hasta un 45% respecto a la distribución balanceada. Sin embargo, esta distribución no resulta efectiva para entornos Grid compuestos de una gran cantidad de máquinas remotas.· La distribución U-Bdomains muestra ser más eficiente, ya que reduce el tiempo de ejecución hasta un 53%. Sin embargo, la escalabilidad de ésta distribución es moderada, debido a que puede llegar a tener un gran número de procesadores que no realizan cálculo intensivo. Estos procesadores únicamente gestionan las comunicaciones remotas. Como limite sólo podemos aplicar esta distribución si más del 50% de los procesadores en una máquina realizan cálculo.· La distribución U-CBdomains reduce los tiempos de ejecución hasta 30%, pero no resulta tan efectiva como la distribución U-Bdomains. Sin embargo, esta distribución incrementa la utilización de los procesadores en 50%, es decir que disminuye los procesadores ociosos

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Mesh Partitioning for Distributed Systems

Author: Jian Chen
Valerie Taylor
Publication venue: Society Press
Publication date
Field of study

Distributed systems, which consist of a collection of high performance systems interconnected via high performance networks (e.g. ATM), are becoming feasible platforms for execution of large-scale, complex problems. In this paper, we address various issues related to mesh partitioning for distributed systems. These issues include the metric used to compare different partitions, efficiency of the application executing on a distributed system, the number of cut sets, and the advantage of exploiting heterogeneity in network performance. We present a tool called PART, for automatic mesh partitioning for distributed systems. The novel feature of PART is that it considers heterogeneities in the application and the distributed system. The heterogeneities in the distributed system include processor and network performance; the heterogeneities in the application include computational complexities. Preliminary results are presented for partitioning regular and irregular finite element meshes for..

CiteSeerX

Mesh Partitioning for Distributed Systems: Exploring Optimal Number of Partitions with Local and Remote Communication

Author: Jian Chen
Valerie E. Taylor
Publication venue
Publication date
Field of study

Mesh partitioning for distributed systems differs from partitioning for homogeneous systems in that both system and application heterogeneities need to be taken into consideration. In this paper, we focus on the issue of optimal number of partitions with local and remote communication. This is an important issue due to the fact that local and remote communication times generally differ by orders of magnitude in terms of latency. Our theoretical analyses indicate when this difference is large, it is advantageous to decrease the number of messages sent remotely, thereby decreasing latency. Further, as the number of processor requiring remote communication decreases, more processors can be used to retrofit the partition, which aids reducing execution time. We use an explicit finite element application and four irregular meshes to experiment on two geographically distributed IBM SP machines. Our results using our tool, PART, show that by forcing remote communication on one processor, the t..

CiteSeerX