106 research outputs found

    Performance evaluation of Fast Ethernet, ATM and Myrinet under PVM

    Get PDF
    Congestion in network switches can limit the communication traffic between Parallel Virtual Machine (PVM) nodes in a parallel computation. The research introduces a new benchmark to evaluate the performance of PVM in various networking environments. The benchmark is used to achieve a better understanding of performance limitations in parallel computing that are imposed by the choice of the network. The networks considered here are Fast Ethernet, Asynchronous Transfer Mode (ATM) OC-3c (155Mb/s) and Myrinet. Together, they represent an interesting range of alternatives for parallel cluster computing. A characterization of network delays and throughput and a comparison of the expected costs of the three environments are developed to provide a basis for an informed decision on the networking methods and topology for a parallel database that is being considered for FBI\u27s National DNA Indexing System (NDIS)[17]. This network is used for communications among the nodes of the parallel machine; thus the security requirements defined for the FBI\u27s Criminal Justice Information Services Division Wide Area Network (CJIS-WAN) [12] are not a concern

    Predictive models for bandwidth sharing in high performance clusters

    Get PDF
    International audienceUsing MPI as communication interface, one or several applications may introduce complex communication behaviors over the network cluster. This effect is increased when nodes of the cluster are multi-processors, and where communications can income or outgo from the same node with a common interval time. Our goal is to understand those behaviors to build a class of predictive models of bandwidth sharing, knowing, on the one hand the flow control mechanisms and, on the other hand, a set of experimental results. This paper present experiences that show how is shared the bandwidth on Gigabit Ethernet, Myrinet 2000 and Infiniband network before to introduce the models for Gigabit Ethernet and Myrinet 2000 networks

    Modeling Network Contention Effects on All-to-All Operations

    Get PDF
    10 pagesOne of the most important collective communication patterns used in scientific applications is the complete exchange, also called All-to-All. Although efficient complete exchange algorithms have been studied for specific networks, general solutions like those available in well-known MPI distributions (e.g. the MPI_Alltoall operation) are strongly influenced by the congestion of network resources. In this paper we present an integrated approach to model the performance of the All-to-All collective operation. Our approach consists in identifying a contention signature that characterizes a given network environment, using it to augment a contention-free communication model. This approach allows an accurate prediction of the performance of the All-to-All operation over different network architectures with a small overhead. This approach is assessed by experimental results using three different network architectures, namely Fast Ethernet, Gigabit Ethernet and Myrinet

    Modelling Network Contention Effects\\ on All-to-All Operations

    Get PDF
    version étendue de l'article publié à CLUSTER2006One of the most important collective communication patterns used in scientific applications is the complete exchange, also called All-to-All. Although efficient complete exchange algorithms have been studied for specific networks, general solutions like those available in well-known MPI distributions (e.g. the MPI_Alltoall operation) are strongly influenced by the congestion of network resources. In this paper we present an integrated approach to model the performance of the All-to-All collective operation. Our approach consists in identifying a contention signature that characterizes a given network environment, using it to augment a contention-free communication model. This approach allows an accurate prediction of the performance of the All-to-All operation over different network architectures with a small overhead. This approach is assessed by experimental results using three different network architectures, namely Fast Ethernet, Gigabit Ethernet and Myrinet

    Assessing Contention Effects on MPI_Alltoall Communications

    Get PDF
    12 pagesInternational audienceOne of the most important collective communication patterns used in scientific applications is the complete exchange, also called All-to-All. Although efficient algorithms have been studied for specific networks, general solutions like those available in well-known MPI distributions (e.g. the MPI_Alltoall operation) are strongly influenced by the congestion of network resources. In this paper we present an integrated approach to model the performance of the All-to-All collective operation, which consists in identifying a contention signature that characterizes a given network environment, using it to augment a contention-free communication model. This approach, assessed by experimental results, allows an accurate prediction of the performance of the All-to-All operation over different network architectures with a small overhead

    End-to-End Latency Prediction for General-Topology Cut-Through Switching Networks

    Get PDF
    Low latency networking is gaining attention to support futuristic network applications like the Tactile Internet with stringent end-to-end latency requirements. In realizing the vision, cut-through (CT) switching is believed to be a promising solution to significantly reduce the latency of today's store-and-forward switching, by splitting a packet into smaller chunks called flits and forwarding them concurrently through input and output ports of a switch. Nevertheless, the end-to-end latency performance of CT switching has not been well studied in heterogeneous networks, which hinders its adoption to general-topology networks with heterogeneous links. To fill the gap, this paper proposes an end-to-end latency prediction model in a heterogeneous CT switching network, where the major challenge comes from the fact that a packet's end-to-end latency relies on how and when its flits are forwarded at each switch while each flit is forwarded individually. As a result, traditional packet-based queueing models are not instantly applicable, and thus we construct a method to estimate per-hop queueing delay via M/G/c queueing approximation, based on which we predict end-to-end latency of a packet. Our extensive simulation results show that the proposed model achieves 3.98-6.05% 90th-percentile error in end-to-end latency prediction

    Optimizing Communication for Massively Parallel Processing

    Get PDF
    The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications. This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects. The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis. We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication. The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication. In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks

    LoGPC: Modeling Network Contention in Message-Passing Programs

    Get PDF
    In many real applications, for example those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant part of the total execution time. This paper presents a new cost model, called LoGPC, that extends the LogP [9] and LogGP [4] models to account for the impact of network contention and network interface DMA behavior on the performance of message-passing programs. We validate LoGPC by analyzing three applications implemented with Active Messages [11, 18] on the MIT Alewife multiprocessor. Our analysis shows that network contention accounts for up to 50% of the total execution time. In addition, we show that the impact of communication locality on the communication costs is at most a factor of two on Alewife. Finally, we use the model to identify tradeoffs between synchronous and asynchronous message passing styles. 1 Introduction Users of parallel machines need good performa..

    Designing Efficient Network Interfaces For System Area Networks

    Full text link
    The network is the key component of a Cluster of Workstations/PCs. Its performance, measured in terms of bandwidth and latency, has a great impact on the overall system performance. It quickly became clear that traditional WAN/LAN technology is not too well suited for interconnecting powerful nodes into a cluster. Their poor performance too often slows down communication-intensive applications. This observation led to the birth of a new class of networks called System Area Networks (SAN). The ATOLL network introduces a new optimized architecture for SANs. On a single chip, not one but four network interfaces (NI) have been implemented, together with an on-chip 4x4 full-duplex switch and four link interfaces. This unique "Network on a Chip" architecture is best suited for interconnecting SMP nodes, where multiple CPUs are given an exclusive NI and do not have to share a single interface. It also removes the need for any additional switching hardware, since the four byte-wide full-duplex links can be connected by cables with neighbor nodes in an arbitrary network topology
    corecore