157 research outputs found

    On Optimal Placements of Processors in Tori Networks

    Get PDF
    Two and three dimensional k-tori are among the most used topologies in the design of new parallel computers. Traditionally (with the exception of the Tera parallel computer), these networks have been used as fully-populated networks, in the sense that every routing node in the topology is subjected to message injection. However, fully-populated tori and meshes exhibit a theoretical throughput which degrades as the network size increases. In addition, the performance of those networks is sensitive to link faults. In contrast, multistage networks (that are partially populated) scale well with the network size. We propose to add slackness in fully-populated tori by reducing the number of processors and we study optimal fault-tolerant routing strategies for the resulting interconnections. The key concept that we study is the average link load in an interconnection network with a given placement and a routing algorithm, where a placement is the subset of the nodes in the interconnection network that are attached to processors. Reducing the load on the links by the choice of a placement and a routing algorithm leads to improvements in both the performance and the fault tolerance of the communication system. Our main contribution is the construction of optimal placements for 2 and 3-dimensional k-tori networks and their corresponding routing algorithms. Those placements yield a linear (in the number of processors) link load and are of optimal size

    Optimal Interleaving on Tori

    Get PDF
    We study t-interleaving on two-dimensional tori, which is defined by the property that any connected subgraph with t or fewer vertices in the torus is labelled by all distinct integers. It has applications in distributed data storage and burst error correction, and is closely related to Lee metric codes. We say that a torus can be perfectly t-interleaved if its t-interleaving number – the minimum number of distinct integers needed to t-interleave the torus – meets the spherepacking lower bound. We prove the necessary and sufficient conditions for tori that can be perfectly t-interleaved, and present efficient perfect t-interleaving constructions. The most important contribution of this paper is to prove that the t-interleaving numbers of tori large enough in both dimensions, which constitute by far the majority of all existing cases, is at most one more than the sphere-packing lower bound, and to present an optimal and efficient t-interleaving scheme for them. Then we prove some bounds on the t-interleaving numbers for other cases, completing a general picture for the t-interleaving problem on 2-dimensional tori

    Reliable low latency I/O in torus-based interconnection networks

    Get PDF
    In today's high performance computing environment I/O remains the main bottleneck in achieving the optimal performance expected of the ever improving processor and memory technologies. Interconnection networks therefore combines processing units, system I/O and high speed switch network fabric into a new paradigm of I/O based network. It decouples the system into computational and I/O interconnections each allowing "any-to-any" communications among processors and I/O devices unlike the shared model in bus architecture. The computational interconnection, a network of processing units (compute-nodes), is used for inter-processor communication in carrying out computation tasks, while the I/O interconnection manages the transfer of I/O requests between the compute-nodes and the I/O or storage media through some dedicated I/O processing units (I /O-nodes). Considering the special functions performed by the I/O nodes, their placement and reliability become important issues in improving the overall performance of the interconnection system. This thesis focuses on design and topological placement of I/O-nodes in torus based interconnection networks, with the aim of reducing I/O communication latency between compute-nodes and I/O-nodes even in the presence of faulty I/O-nodes. We propose an efficient and scalable relaxed quasi-perfect placement scheme using Lee distance error correction code such that compute-nodes are at distance-t or at most distance-t+1 from an I/O-node for a given t. This scheme provides a better and optimal alternative placement than quasi perfect placement when perfect placement cannot be found for a particular torus. Furthermore, in the occurrence of faulty I/O-nodes, the placement scheme is also used in determining other alternative I/O-nodes for rerouting I/O traffic from affected compute-nodes with minimal slowdown. In order to guarantee the quality of service required of inter-processor communication, a scheduling algorithm was developed at the router level to prioritize message forwarding according to inter-process and I/O messages with the former given higher priority. Our simulation results show that relaxed quasi-perfect outperforms quasi-perfect and the conventional I/O placement (where I/O nodes are concentrated at the base of the torus interconnection) with little degradation in inter-process communication performance. Also the fault tolerant redirection scheme provides a minimal slowdown, especially when the number of faulty I/O nodes is less than half of the initial available I/O nodes

    Optimizing Communication for Massively Parallel Processing

    Get PDF
    The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications. This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects. The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis. We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication. The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication. In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks
    corecore