Search CORE

22 research outputs found

Interconnection Networks Embeddings and Efficient Parallel Computations.

Author: Abuelrub Emadeddin Mohamed
Publication venue: LSU Digital Commons
Publication date: 01/01/1993
Field of study

To obtain a greater performance, many processors are allowed to cooperate to solve a single problem. These processors communicate via an interconnection network or a bus. The most essential function of the underlying interconnection network is the efficient interchanging of messages between processes in different processors. Parallel machines based on the hypercube topology have gained a great respect in parallel computation because of its many attractive properties. Many versions of the hypercube have been introduced by many researchers mainly to enhance communications. The twisted hypercube is one of the most attractive versions of the hypercube. It preserves the important features of the hypercube and reduces its diameter by a factor of two. This dissertation investigates relations and transformations between various interconnection networks and the twisted hypercube and explore its efficiency in parallel computation. The capability of the twisted hypercube to simulate complete binary trees, complete quad trees, and rings is demonstrated and compared with the hypercube. Finally, the fault-tolerance of the twisted hypercube is investigated. We present optimal algorithms to simulate rings in a faulty twisted hypercube environment and compare that with the hypercube

Louisiana State University

An improved fault mitigation strategy for CUDA Fermi GPUs

Author: Di Carlo S.
Gambardella G.
Martella I.
Prinetto P.
Rolfo D.
Trotta P.
Publication venue
Publication date: 01/01/2014
Field of study

High computation is a predominant requirement in many applications. In this field, Graphic Processing Units (GPUs) are more and more adopted. Low prices and high parallelism let GPUs be attractive, even in safety critical applications. Nonetheless, new methodologies must be studied and developed to increase the dependability of GPUs. This paper presents an improved fault mitigation strategy against permanent faults for CUDA Fermi GPUs. The proposed approach exploits the reverse engineering of the block scheduling policy in CUDA Fermi GPUs in order to minimize the fault mitigation timing overhead. The graceful performance degradation achieved by the proposed technique outperforms multithreaded CPU implementations and other fault mitigation strategies for CUDA GPU, even in presence of multiple permanent faults

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Hypercube-Based Topologies With Incremental Link Redundancy.

Author: Latifi Shahram
Publication venue: LSU Digital Commons
Publication date: 01/01/1989
Field of study

Hypercube structures have received a great deal of attention due to the attractive properties inherent to their topology. Parallel algorithms targeted at this topology can be partitioned into many tasks, each of which running on one node processor. A high degree of performance is achievable by running every task individually and concurrently on each node processor available in the hypercube. Nevertheless, the performance can be greatly degraded if the node processors spend much time just communicating with one another. The goal in designing hypercubes is, therefore, to achieve a high ratio of computation time to communication time. The dissertation addresses primarily ways to enhance system performance by minimizing the communication time among processors. The need for improving the performance of hypercube networks is clearly explained. Three novel topologies related to hypercubes with improved performance are proposed and analyzed. Firstly, the Bridged Hypercube (BHC) is introduced. It is shown that this design is remarkably more efficient and cost-effective than the standard hypercube due to its low diameter. Basic routing algorithms such as one to one and broadcasting are developed for the BHC and proven optimal. Shortcomings of the BHC such as its asymmetry and limited application are clearly discussed. The Folded Hypercube (FHC), a symmetric network with low diameter and low degree of the node, is introduced. This new topology is shown to support highly efficient communications among the processors. For the FHC, optimal routing algorithms are developed and proven to be remarkably more efficient than those of the conventional hypercube. For both BHC and FHC, network parameters such as average distance, message traffic density, and communication delay are derived and comparatively analyzed. Lastly, to enhance the fault tolerance of the hypercube, a new design called Fault Tolerant Hypercube (FTH) is proposed. The FTH is shown to exhibit a graceful degradation in performance with the existence of faults. Probabilistic models based on Markov chain are employed to characterize the fault tolerance of the FTH. The results are verified by Monte Carlo simulation. The most attractive feature of all new topologies is the asymptotically zero overhead associated with them. The designs are simple and implementable. These designs can lead themselves to many parallel processing applications requiring high degree of performance

Louisiana State University

Mapping Signal Processing Algorithms on Parallel Arcidtectures

Author: Sammur Nidal M.
Publication venue: 'Oklahoma State University Library'
Publication date: 01/07/1992
Field of study

Electrical Engineerin

SHAREOK repository

Performance effects of node mapping on the IBM BlueGene/L machine

Author: Smith Brian Edward
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2005
Field of study

The IBM BlueGene/L (BG/L) supercomputer is a new machine consisting of up to 65536 relatively modest compute nodes connected with three application-level networks -- a high-performance point-to-point 3D torus network, a global combining/broadcast tree network for collective operations, and a global interrupt/barrier network for extremely fast global barriers. The BG/L control system allows the user to assign MPI logical ranks to physical torus coordinates at run-time in an arbitrary manner as long as all nodes are uniquely included in the mapping. This presents the possibility of increasing application performance with very little effort. This thesis investigates the performance effects of node mapping with several benchmarks and scientific codes using a variety of existing and new mapping strategies. The benchmarks are the NAS parallel benchmarks, the Ames Laboratory Classical Molecular dynamics code (ALCMD), and the General Atomic and Molecular Electronic Structure System (GAMESS) application. The NAS benchmarks are short, easy to understand, and fairly well known. ALCMD has an interesting communication pattern that should benefit from a good mapping strategy. GAMESS is one application that is not necessarily well-suited for running on BlueGene because it requires a large amount of compute power and memory per node. However, it provides an interesting data point for performance of applications that were not designed for a particular system and the possible benefits of mapping on such applications. The mappings investigated were the stock permutations (XYZ, XZY, etc), Gray-code based mesh mappings, random maps, variations on Gray-code maps for embedding 2D meshes in the 3D torus, and three maps designed for GAMESS. Performance results are presented for node mappings on several BG/L partition sizes

Digital Repository @ Iowa State University (ISU)

The application of optimal transputer architecture to concurrent processing in the implementation of vision processing algorithms

Author: Bennett Ian
Publication venue
Publication date: 01/05/1989
Field of study

University of South Wales Research Explorer

Combinatorial Design and Analysis of Optimal Multiple Bus Systems for Parallel Algorithms.

Author: Kulasinghe Priyalal D
Publication venue: LSU Digital Commons
Publication date: 01/01/1995
Field of study

This dissertation develops a formal and systematic methodology for designing optimal, synchronous multiple bus systems (MBSs) realizing given (classes of) parallel algorithms. Our approach utilizes graph and group theoretic concepts to develop the necessary model and procedural tools. By partitioning the vertex set of the graphical representation CFG of the algorithm, we extract a set of interconnection functions that represents the interprocessor communication requirement of the algorithm. We prove that the optimal partitioning problem is NP-Hard. However, we show how to obtain polynomial time solutions by exploiting certain regularities present in many well-behaved parallel algorithms. The extracted set of interconnection functions is represented by an edge colored, directed graph called interconnection function graph (IFG). We show that the problem of constructing an optimal MBS to realize an IFG is NP-Hard. We show important special cases where polynomial time solutions exist. In particular, we prove that polynomial time solutions exist when the IFG is vertex symmetric. This is the case of interest for the vast majority of important interconnection function sets, whether extracted from algorithms or correspond to existing interconnection networks. We show that an IFG is vertex symmetric if and only if it is the Cayley color graph of a finite group

\Gamma

and its generating set

\Delta.

Using this property, we present a particular scheme to construct a symmetric

MBS\ M(\Gamma,\Delta)

with minimum number of buses as well as minimum number of interfaces realizing a vertex symmetric IFG. We demonstrate several advantages of the optimal

MBS\ M(\Gamma,\Delta)

in terms of its symmetry, number of ports per processor, number of neighbors per processor, and the diameter. We also investigate the fault tolerant capabilities and performance degradation of

M(\Gamma,\Delta)

in the case of a single bus failure, single driver failure, single receiver failure, and single processor failure. Further, we address the problem of designing an optimal MBS realizing a class of algorithms when the number of buses and/or processors in the target MBS are specified. The optimality criteria are maximizing the speed and minimizing the number of interfaces

Louisiana State University

Recommended from our members

Investigation into the wafer-scale integration of fine-grain parallel processing computer systems

Author: Jones Simon Richard
Publication venue: Brunel University
Publication date: 01/01/1986
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.This thesis investigates the potential of wafer-scale integration (WSI) for the implementation of low-cost fine-grain parallel processing computer systems. As WSI is a relatively new subject, there was little work on which to base investigations. Indeed, most WSI architectures existed only as untried and sometimes vague proposals. Accordingly, the research strategy approached this problem by identifying a representative WSI structure and architecture on which to base investigations. An analysis of architectural proposals identified associative memory to be general purpose parallel processing component used in a wide range of WSI architectures. Furthermore, this analysis provided a set of WSI-level design requirements to evaluate the sustainability of different architectures as research vehicles. The WSI-ASP (WASP) device, which has a large associative memory as its main component is shown to meet these requirements and hence was chosen as the research vehicle. Consequently, this thesis addresses WSI potential through an in-depth investigation into the feasibility of implementing a large associative memory for the WASP device that meets the demanding technological constraints of WSI. Overall, the thesis concludes that WSI offers significant potential for the implementation of low-cost fine-grain parallel processing computer systems. However, due to the dual constraints of thermal management and the area required for the power distribution network, power density is a major design constraint in WSI. Indeed, it is shown that WSI power densities need to be an order of magnitude lower than VLSI power densities. The thesis demonstrates that for associative memories at least, VLSI designs are unsuited to implementation in WSI. Rather, it is shown that WSI circuits must be closely matched to the operational environment to assure suitable power densities. These circuits are significantly larger than their VLSI equivalents. Nonetheless, the thesis demonstrates that by concentrating on the most power intensive circuits, it is possible to achieve acceptable power densities with only a modest increase in area overheads.SER

Brunel University Research Archive

Parallel and Distributed Computing

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

Directory of Open Access Books (DOAB)