21 research outputs found

    Exploring the predictability of MPI messages

    Get PDF
    Scalability to a large number of processes is one of the weaknesses of current MPI implementations. Standard implementations are able to scale to hundreds of nodes, but none beyond that. The main problem of current implementations is that performance is more important than scalability and thus some assumptions about resources are taken that will not scale well. The objective of the paper is twofold. On one hand, we show that characteristics such as the size and the sender of MPI messages are very predictable (accuracy above 90%). On the other hand, we present some examples where current MPI implementations would not work well when run on a large configuration and how this predictability could be used to solve the scalability problem.This work has been supported by the Ministry of Science and Technology of Spain and the European Union (FEDER funds) under contract TIC2001-0995-C02-01.Peer ReviewedPostprint (author's final draft

    Implementation of MPICH on top of MPLi̲te

    Get PDF
    The goal of this thesis is to develop a new Channel Interface device for the MPICH implementation of the MPI (Message Passing Interface) standard using MPLi̲te. MPLi̲te is a lightweight message-passing library that is not a full MPI implementation, but offers high performance. MPICH (Message Passing Interface CHameleon) is a full implementation of the MPI standard that has the p4 library as the underlying communication device for TCP/IP networks. By integrating MPLi̲te as a Channel Interface device in MPICH, a parallel programmer can utilize the full MPI implementation of MPICH as well as the high bandwidth offered by MPLi̲te. There are several layers in the MPICH library where one can tie a new device. The Channel Interface is the lowest layer that requires very few functions to add a new device. By attaching MPLi̲te to MPICH at the lowest level, the Channel Interface, almost all of the performance of the MPLi̲te library can be delivered to the applications using MPICH. MPLi̲te can be implemented either as a blocking or a non-blocking Channel Interface device. The performance was measured on two separate test clusters, the PC and the Alpha mini-clusters, having Gigabit Ethernet connections. The PC cluster has two 1.8 GHz Pentium 4 PCs and the Alpha cluster has two 500 MHz Compaq DS20 workstations. Different network interface cards like Netgear, TrendNet and SysKonnect Gigabit Ethernet cards were used for the measurements. Both the blocking and non-blocking MPICH-MPLi̲te Channel Interface devices perform close to raw TCP, whereas a performance loss of 25-30% is seen in the MPICH-p4 Channel Interface device for larger messages. The superior performance offered by the MPICH-MPLi̲te device compared to the MPICH-p4 device can be easily seen on the SysKonnect cards using jumbo frames. The throughput curve also improves considerably by increasing the Eager/Rendezvous threshold

    Implementation of MPICH on Top of MP_Lite

    Full text link

    Optimizing 10-Gigabit Ethernet for Networks of Workstations, Clusters, and Grids: A Case Study

    Full text link

    Run-time Support for Distributed Object Sharing in Safe Programming Languages

    Get PDF
    We present a new run-time system that supports object sharing in a distributed system. The key insight in this system is that a handle-based implementation of such a system enables effcient and transparent sharing of data with both fine-grained and coarse-grained access patterns. In addition, it supports effcient execution of garbage-collected programs. In contrast, conventional distributed shared memory (DSM) systems are limited to providing only one granularity with good performance, and have experienced diffculty in effciently supporting garbage collection. A safe language, in which no pointer arithmetic is allowed, can transparently be compiled into a handle-based system and constitutes its preferred mode of use. A programmer can also directly use a handle-based programming model that avoids pointer arithmetic on the handles, and achieve the same performance but without the programming benefits of a safe programming language. This new run-time system, DOSA (Distributed Object Sharing Architecture), provides a shared object space abstraction rather than a shared address space abstraction. The key to its effciency is the observation that a handle-based distributed implementation permits VM-based access and modification detection without suffering false sharing for fine-grained access patterns. We compare DOSA to TreadMarks, a conventional DSM system that is effcient at handling coarse-grained sharing. The performance of fine-grained applications and garbage-collected applications is considerably better than in TreadMarks. The performance of coarse-grained applications is nearly as good as in TreadMarks. Since the performance of such applications is already good in TreadMarks, we consider this an acceptable performance penalty

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    Weapon Release Scheduling from Multiple-Bay Aircraft using Multi-Objective Evolutionary Algorithms

    Get PDF
    The United States Air Force has put an increased emphasis on the timely delivery of precision weapons. Part of this effort has been to us multiple bay aircraft such the B-1B Lancer and B-52 Stratofortress to provide Close Air Support and responsive strikes using 1760 weapons. In order to provide greater flexibility, the aircraft carry heterogeneous payloads which can require deconfliction in order to drop multiple different types of weapons. Current methods of deconfliction and weapon selection are highly crew dependent and work intensive. This research effort investigates the optimization of an algorithm for weapon release which allows the aircraft to perform deconfliction automatically. This reduces crew load and response time in order to deal with time-sensitive targets. The overall problem maps to the Job-Shop Scheduling problem. Optimization of the algorithm is done through the General Multiobjective Parallel Genetic Algorithm (GENMOP). We examine the results from pedagogical experiments and real-world test scenarios in the light of improving decision making. The results are encouraging in that the program proves capable of finding acceptable release schedules, however the solution space is such that applying the program to real world situations is unnecessary. We present visualizations of the schedules which demonstrate these conclusions

    Architectural support for enhancing security in clusters

    Get PDF
    Cluster computing has emerged as a common approach for providing more comput- ing and data resources in industry as well as in academia. However, since cluster computer developers have paid more attention to performance and cost e±ciency than to security, numerous security loopholes in cluster servers come to the forefront. Clusters usually rely on ¯rewalls for their security, but the ¯rewalls cannot prevent all security attacks; therefore, cluster systems should be designed to be robust to security attacks intrinsically. In this research, we propose architectural supports for enhancing security of clus- ter systems with marginal performance overhead. This research proceeds in a bottom- up fashion starting from enforcing each cluster component's security to building an integrated secure cluster. First, we propose secure cluster interconnects providing con- ¯dentiality, authentication, and availability. Second, a security accelerating network interface card architecture is proposed to enable low performance overhead encryption and authentication. Third, to enhance security in an individual cluster node, we pro- pose a secure design for shared-memory multiprocessors (SMP) architecture, which is deployed in many clusters. The secure SMP architecture will provide con¯dential communication between processors. This will remove the vulnerability of eavesdrop- ping attacks in a cluster node. Finally, to put all proposed schemes together, we propose a security/performance trade-o® model which can precisely predict performance of an integrated secure cluster
    corecore