114 research outputs found

    A new-generation class of parallel architectures and their performance evaluation

    Get PDF
    The development of computers with hundreds or thousands of processors and capability for very high performance is absolutely essential for many computation problems, such as weather modeling, fluid dynamics, and aerodynamics. Several interconnection networks have been proposed for parallel computers. Nevertheless, the majority of them are plagued by rather poor topological properties that result in large memory latencies for DSM (Distributed Shared-Memory) computers. On the other hand, scalable networks with very good topological properties are often impossible to build because of their prohibitively high VLSI (e.g., wiring) complexity. Such a network is the generalized hypercube (GH). The GH supports full-connectivity of its nodes in each dimension and is characterized by outstanding topological properties. In addition, low-dimensional GHs have very large bisection widths. We propose in this dissertation a new class of processor interconnections, namely HOWs (Highly Overlapping Windows), that are more generic than the GH, are highly scalable, and have comparable performance. We analyze the communications capabilities of 2-D HOW systems and demonstrate that in practical cases HOW systems perform much better than binary hypercubes for important communications patterns. These properties are in addition to the good scalability and low hardware complexity of HOW systems. We present algorithms for one-to-one, one-to-all broadcasting, all-to-all broadcasting, one-to-all personalized, and all-to-all personalized communications on HOW systems. These algorithms are developed and evaluated for several communication models. In addition, we develop techniques for the efficient embedding of popular topologies, such as the ring, the torus, and the hypercube, into 1-D and 2-D HOW systems. The objective is to show that 2-D HOW systems are not only scalable and easy to implement, but they also result in good embedding of several classical topologies

    CLEX: Yet Another Supercomputer Architecture?

    Get PDF
    We propose the CLEX supercomputer topology and routing scheme. We prove that CLEX can utilize a constant fraction of the total bandwidth for point-to-point communication, at delays proportional to the sum of the number of intermediate hops and the maximum physical distance between any two nodes. Moreover, % applying an asymmetric bandwidth assignment to the links, all-to-all communication can be realized (1+o(1))(1+o(1))-optimally both with regard to bandwidth and delays. This is achieved at node degrees of nΔn^{\varepsilon}, for an arbitrary small constant Δ∈(0,1]\varepsilon\in (0,1]. In contrast, these results are impossible in any network featuring constant or polylogarithmic node degrees. Through simulation, we assess the benefits of an implementation of the proposed communication strategy. Our results indicate that, for a million processors, CLEX can increase bandwidth utilization and reduce average routing path length by at least factors 1010 respectively 55 in comparison to a torus network. Furthermore, the CLEX communication scheme features several other properties, such as deadlock-freedom, inherent fault-tolerance, and canonical partition into smaller subsystems

    I/O embedding and broadcasting in star interconnection networks

    Full text link
    The issues of communication between a host or central controller and processors, in large interconnection networks are very important and have been studied in the past by several researchers. There is a plethora of problems that arise when processors are asked to exchange information on parallel computers on which processors are interconnected according to a specific topology. In robust networks, it is desirable at times to send (receive) data/control information to (from) all the processors in minimal time. This type of communication is commonly referred to as broadcasting. To speed up broadcasting in a given network without modifying its topology, certain processors called stations can be specified to act as relay agents. In this thesis, broadcasting issues in a star-based interconnection network are studied. The model adopted assumes all-port communication and wormhole switching mechanism. Initially, the problem treated is one of finding the minimum number of stations required to cover all the nodes in the star graph with i-adjacency. We consider 1-, 2-, and 3-adjacencies and determine the upper bound on the number of stations required to cover the nodes for each case. After deriving the number of stations, two algorithms are designed to broadcast the messages first from the host to stations, and then from stations to remaining nodes; In addition, a Binary-based Algorithm is designed to allow routing in the network by directly working on the binary labels assigned to the star graph. No look-up table is consulted during routing and minimum number of bits are used to represent a node label. At the end, the thesis sheds light on another algorithm for routing using parallel paths in the star network

    Work-preserving real-time emulation of meshes on butterfly networks

    Get PDF
    The emulation of a guest network G on a host network H is work-preserving and real-time if the inefficiency, that is the ratio WG/WH of the amounts of work done in both networks, and the slowdown of the emulation are O(1). In this thesis we show that an infinite number of meshes can be emulated on a butterfly in a work-preserving real-time manner, despite the fact that any emulation of an s x s-node mesh in a butterfly with load 1 has a dilation of Ω(logs). The recursive embedding of a mesh in a butterfly presented by Koch et al. (STOC 1989), which forms the basis for our work, is corrected and generalized by relaxing unnecessary constraints. An algorithm determining the parameter for each stage of the recursion is described and a rigorous analysis of the resulting emulation shows that it is work-preserving and real-time for an infinite number of meshes. Data obtained from simulated embeddings suggests possible improvements to achieve a truly work-preserving emulation of the class of meshes on the class of butterflies

    Optimal Permutation Routing for Low-dimensional Hypercubes

    Get PDF
    We consider the offline problem of routing a permutation of tokens on the nodes of a d-dimensional hypercube, under a queueless MIMD communication model (under the constraints that each hypercube edge may only communicate one token per communication step, and each node may only be occupied by a single token between communication steps). For a d-dimensional hypercube, it is easy to see that d communication steps are necessary. We develop a theory of “separability ” which enables an analytical proof that d steps suffice for the case d = 3, and facilitates an experimental verification that d steps suffice for d = 4. This result improves the upper bound for the number of communication steps required to route an arbitrary permutation on arbitrarily large hypercubes to 2d − 4. We also find an interesting side-result, that the number of possible communication steps in a d-dimensional hypercube is the same as the number of perfect matchings in a (d + 1)-dimensional hypercube, a combinatorial quantity for which there is no closed-form expression. Finally we present some experimental observations which may lead to a proof of a more general result for arbitrarily large dimension d. 2

    Comparative Analysis of Hill Climbing Mapping Algorithms

    Get PDF
    The performance of a parallel algorithm depends in part on how well the communication structure of the algorithm is matched to the communication structure of the target parallel system. The mapping problem is the problem of generating such a match algorithmically. Solving the mapping problem optimally for any non-trivial case is NP-complete. Therefore, a heuristic approach must be used to solve the problem. Although several heuristic algorithms to this problem have been developed, their performance has been evaluated on relatively few combinations of communication and processor structures. This paper extensively evaluates the performance of hill climbing mapping algorithms through simulation on communication structures representative of existing parallel algorithms and architectures. The motivations for our study are as follows: to establish the differences in performance between variations of the hill climbing heuristic; to determine the factors which affect the performance of hill climbing with respect to optimum; and to compare hill climbing to known optimum and non-optimum mappings to determine the effectiveness of hill climbing as a mapping heuristic

    Algorithms for Mapping Parallel Processes onto Grid and Torus Architectures

    Full text link
    Static mapping is the assignment of parallel processes to the processing elements (PEs) of a parallel system, where the assignment does not change during the application's lifetime. In our scenario we model an application's computations and their dependencies by an application graph. This graph is first partitioned into (nearly) equally sized blocks. These blocks need to communicate at block boundaries. To assign the processes to PEs, our goal is to compute a communication-efficient bijective mapping between the blocks and the PEs. This approach of partitioning followed by bijective mapping has many degrees of freedom. Thus, users and developers of parallel applications need to know more about which choices work for which application graphs and which parallel architectures. To this end, we not only develop new mapping algorithms (derived from known greedy methods). We also perform extensive experiments involving different classes of application graphs (meshes and complex networks), architectures of parallel computers (grids and tori), as well as different partitioners and mapping algorithms. Surprisingly, the quality of the partitions, unless very poor, has little influence on the quality of the mapping. More importantly, one of our new mapping algorithms always yields the best results in terms of the quality measure maximum congestion when the application graphs are complex networks. In case of meshes as application graphs, this mapping algorithm always leads in terms of maximum congestion AND maximum dilation, another common quality measure.Comment: Accepted at PDP-201

    Parallel Computation on Hypercube-Like Machines.

    Get PDF
    The hypercube interconnection network has been recognized to be very suitable for a parallel computing architecture due to its attractive topological properties. Recently, several modified hypercubes have been propose to improve the performance of a hypercube. This dissertation deals with two modified hypercubes, the X-hypercube and the Z-cube. The X-hypercube is a variant of the hypercube, with the same amount of hardware but a diameter of only ⌈\lceil(n + 1)/2⌉\rceil in a hypercube of dimension n. The Z-cube has only 75 percent of the edges of a hypercube with the same number vertices and the same diameter as the hypercube. In this dissertation, we investigate some topological properties and the effectiveness of the X-hypercube and the Z-cube in their combinatorial and computational aspects. We give the optimal or nearly optimal data communication algorithms including routing, broadcasting, and census function for the X-hypercube and the Z-cube. We also give the optimal embedding algorithms between the X-hypercube and the hypercube. It is shown that the average distance between vertices in a X-hypercube is roughly 13/16 of that in a hypercube. This implies that a X-hypercube achieves the better average communication performance than a hypercube. In addition, a set of fundamental SIMD algorithms for a X-hypercube is given. Our results indicate that the X-hypercube makes an improvement in performance over the hypercube, but not as much as the reduction in a diameter, and the Z-cube is a good alternative for the hypercube as far as the VLSI implementation is of major concern
    • 

    corecore