Search CORE

157 research outputs found

On Optimal Placements of Processors in Tori Networks

Author: Blaum Mario
Bruck Jehoshua
Pifarre Gustavo ED.
Sanz Jorge L. C.
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/1996
Field of study

Two and three dimensional k-tori are among the most used topologies in the design of new parallel computers. Traditionally (with the exception of the Tera parallel computer), these networks have been used as fully-populated networks, in the sense that every routing node in the topology is subjected to message injection. However, fully-populated tori and meshes exhibit a theoretical throughput which degrades as the network size increases. In addition, the performance of those networks is sensitive to link faults. In contrast, multistage networks (that are partially populated) scale well with the network size. We propose to add slackness in fully-populated tori by reducing the number of processors and we study optimal fault-tolerant routing strategies for the resulting interconnections. The key concept that we study is the average link load in an interconnection network with a given placement and a routing algorithm, where a placement is the subset of the nodes in the interconnection network that are attached to processors. Reducing the load on the links by the choice of a placement and a routing algorithm leads to improvements in both the performance and the fault tolerance of the communication system. Our main contribution is the construction of optimal placements for 2 and 3-dimensional k-tori networks and their corresponding routing algorithms. Those placements yield a linear (in the number of processors) link load and are of optimal size

Caltech Authors

Recommended from our members

On resource placements and fault-tolerant broadcasting in toroidal networks

Author: AlMohammad Bader Fahed AlBedaiwi
Publication venue: 'Oregon State University'
Publication date
Field of study

Parallel computers are classified into: Multiprocessors, and multicomputers. A multiprocessor system usually has a shared memory through which its processors can communicate. On the other hand, the processors of a multicomputer system communicate by message passing through an interconnection network. A widely used class of interconnection networks is the toroidal networks. Compared to a hypercube, a torus has a larger diameter, but better tradeoffs, such as higher channel bandwidth and lower node degree. Results on resource placements and fault-tolerant broadcasting in toroidal networks are presented. Given a limited number of resources, it is desirable to distribute these resources over the interconnection network so that the distance between a non-resource and a closest resource is minimized. This problem is known as distance-d placement. In such a placement, each non-resource must be within a distance of d or less from at least one resource, where the number of resources used is the least possible. Solutions for distance-d placements in 2D and 3D tori are proposed. These solutions are compared with placements used so far in practice. Simulation experiments show that the proposed solutions are superior to the placements used in practice in terms of reducing average network latency. The complexity of a multicomputer increases the chances of having processor failures. Therefore, designing fault-tolerant communication algorithms is quite necessary for a sufficient utilization of such a system. Broadcasting (single-node one-to-all) in a multicomputer is one of the important communication primitives. A non-redundant fault-tolerant broadcasting algorithm in a faulty toroidal network is designed. The algorithm can adapt up to (2n-2) processor failures. Compared to the optimal algorithm in a fault-free n-dimensional toroidal network, the proposed algorithm requires at most 3 extra communication steps using cut through packet routing, and (n + 1) extra steps using store-and-forward routing

ScholarsArchive@OSU

Optimal Interleaving on Tori

Author: Bruck Jehoshua
Cook Matthew
Jiang Anxiao (Andrew)
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/2004
Field of study

We study t-interleaving on two-dimensional tori, which is defined by the property that any connected subgraph with t or fewer vertices in the torus is labelled by all distinct integers. It has applications in distributed data storage and burst error correction, and is closely related to Lee metric codes. We say that a torus can be perfectly t-interleaved if its t-interleaving number – the minimum number of distinct integers needed to t-interleave the torus – meets the spherepacking lower bound. We prove the necessary and sufficient conditions for tori that can be perfectly t-interleaved, and present efficient perfect t-interleaving constructions. The most important contribution of this paper is to prove that the t-interleaving numbers of tori large enough in both dimensions, which constitute by far the majority of all existing cases, is at most one more than the sphere-packing lower bound, and to present an optimal and efficient t-interleaving scheme for them. Then we prove some bounds on the t-interleaving numbers for other cases, completing a general picture for the t-interleaving problem on 2-dimensional tori

CiteSeerX

Caltech Authors

Recommended from our members

Resource placement, data rearrangement, and Hamiltonian cycles in torus networks

Author: Bae Myung Mun
Publication venue: 'Oregon State University'
Publication date
Field of study

Many parallel machines, both commercial and experimental, have been/are being designed with toroidal interconnection networks. For a given number of nodes, the torus has a relatively larger diameter, but better cost/performance tradeoffs, such as higher channel bandwidth, and lower node degree, when compared to the hypercube. Thus, the torus is becoming a popular topology for the interconnection network of a high performance parallel computers. In a multicomputer, the resources, such as I/O devices or software packages, are distributed over the networks. The first part of the thesis investigates efficient methods of distributing resources in a torus network. Three classes of placement methods are studied. They are (1) distant-t placement problem: in this case, any non-resource node is at a distance of at most t from some resource nodes, (2) j-adjacency problem: here, a non-resource node is adjacent to at least j resource nodes, and (3) generalized placement problem: a non-resource node must be a distance of at most t from at least j resource nodes. This resource placement technique can be applied to allocating spare processors to provide fault-tolerance in the case of the processor failures. Some efficient spare processor placement methods and reconfiguration schemes in the case of processor failures are also described. In a torus based parallel system, some algorithms give best performance if the data are distributed to processors numbered in Cartesian order; in some other cases, it is better to distribute the data to processors numbered in Gray code order. Since the placement patterns may be changed dynamically, it is essential to find efficient methods of rearranging the data from Gray code order to Cartesian order and vice versa. In the second part of the thesis, some efficient methods for data transfer from Cartesian order to radix order and vice versa are developed. The last part of the thesis gives results on generating edge disjoint Hamiltonian cycles in k-ary n-cubes, hypercubes, and 2D tori. These edge disjoint cycles are quite useful for many communication algorithms

ScholarsArchive@OSU

Reliable low latency I/O in torus-based interconnection networks

Author: Azeez Babatunde
Publication venue: Texas A&M University
Publication date: 25/04/2007
Field of study

In today's high performance computing environment I/O remains the main bottleneck in achieving the optimal performance expected of the ever improving processor and memory technologies. Interconnection networks therefore combines processing units, system I/O and high speed switch network fabric into a new paradigm of I/O based network. It decouples the system into computational and I/O interconnections each allowing "any-to-any" communications among processors and I/O devices unlike the shared model in bus architecture. The computational interconnection, a network of processing units (compute-nodes), is used for inter-processor communication in carrying out computation tasks, while the I/O interconnection manages the transfer of I/O requests between the compute-nodes and the I/O or storage media through some dedicated I/O processing units (I /O-nodes). Considering the special functions performed by the I/O nodes, their placement and reliability become important issues in improving the overall performance of the interconnection system. This thesis focuses on design and topological placement of I/O-nodes in torus based interconnection networks, with the aim of reducing I/O communication latency between compute-nodes and I/O-nodes even in the presence of faulty I/O-nodes. We propose an efficient and scalable relaxed quasi-perfect placement scheme using Lee distance error correction code such that compute-nodes are at distance-t or at most distance-t+1 from an I/O-node for a given t. This scheme provides a better and optimal alternative placement than quasi perfect placement when perfect placement cannot be found for a particular torus. Furthermore, in the occurrence of faulty I/O-nodes, the placement scheme is also used in determining other alternative I/O-nodes for rerouting I/O traffic from affected compute-nodes with minimal slowdown. In order to guarantee the quality of service required of inter-processor communication, a scheduling algorithm was developed at the router level to prioritize message forwarding according to inter-process and I/O messages with the former given higher priority. Our simulation results show that relaxed quasi-perfect outperforms quasi-perfect and the conventional I/O placement (where I/O nodes are concentrated at the base of the torus interconnection) with little degradation in inter-process communication performance. Also the fault tolerant redirection scheme provides a minimal slowdown, especially when the number of faulty I/O nodes is less than half of the initial available I/O nodes

Texas A&M Repository

Optimizing Communication for Massively Parallel Processing

Author: Kumar Sameer
Publication venue
Publication date: 01/04/2005
Field of study

The current trends in high performance computing show that large machines with tens of thousands of processors will soon be readily available. The IBM Bluegene-L machine with 128k processors (which is currently being deployed) is an important step in this direction. In this scenario, it is going to be a significant burden for the programmer to manually scale his applications. This task of scaling involves addressing issues like load-imbalance and communication overhead. In this thesis, we explore several communication optimizations to help parallel applications to easily scale on a large number of processors. We also present automatic runtime techniques to relieve the programmer from the burden of optimizing communication in his applications. This thesis explores processor virtualization to improve communication performance in applications. With processor virtualization, the computation is mapped to virtual processors (VPs). After one VP has finished computation and is waiting for responses to its messages, another VP can compute, thus overlapping communication with computation. This overlap is only effective if the processor overhead of the communication operation is a small fraction of the total communication time. Fortunately, with network interfaces having co-processors, this happens to be true and processor virtualization has a natural advantage on such interconnects. The communication optimizations we present in this thesis, are motivated by applications such as NAMD (a classical molecular dynamics application) and CPAIMD (a quantum chemistry application). Applications like NAMD and CPAIMD consume a fair share of the time available on supercomputers. So, improving their performance would be of great value. We have successfully scaled NAMD to 1TF of peak performance on 3000 processors of PSC Lemieux, using the techniques presented in this thesis. We study both point-to-point communication and collective communication (specifically all-to-all communication). On a large number of processors all-to-all communication can take several milli-seconds to finish. With synchronous collectives defined in MPI, the processor idles while the collective messages are in flight. Therefore, we demonstrate an asynchronous collective communication framework, to let the CPU compute while the all-to-all messages are in flight. We also show that the best strategy for all-to-all communication depends on the message size, number of processors and other dynamic parameters. This suggests that these parameters can be observed at runtime and used to choose the optimal strategy for all-to-all communication. In this thesis, we demonstrate adaptive strategy switching for all-to-all communication. The communication optimization framework presented in this thesis, has been designed to optimize communication in the context of processor virtualization and dynamic migrating objects. We present the streaming strategy to optimize fine grained object-to-object communication. In this thesis, we motivate the need for hardware collectives, as processor based collectives can be delayed by intermediate that processors busy with computation. We explore a next generation interconnect that supports collectives in the switching hardware. We show the performance gains of hardware collectives through synthetic benchmarks

Illinois Digital Environment for Access to Learning and Scholarship Repository