Search CORE

254 research outputs found

Parallelized reliability estimation of reconfigurable computer networks

Author: Das Subhendu
Nicol David M.
Palumbo Dan
Publication venue
Publication date
Field of study

A parallelized system, ASSURE, for computing the reliability of embedded avionics flight control systems which are able to reconfigure themselves in the event of failure is described. ASSURE accepts a grammar that describes a reliability semi-Markov state-space. From this it creates a parallel program that simultaneously generates and analyzes the state-space, placing upper and lower bounds on the probability of system failure. ASSURE is implemented on a 32-node Intel iPSC/860, and has achieved high processor efficiencies on real problems. Through a combination of improved algorithms, exploitation of parallelism, and use of an advanced microprocessor architecture, ASSURE has reduced the execution time on substantial problems by a factor of one thousand over previous workstation implementations. Furthermore, ASSURE's parallel execution rate on the iPSC/860 is an order of magnitude faster than its serial execution rate on a Cray-2 supercomputer. While dynamic load balancing is necessary for ASSURE's good performance, it is needed only infrequently; the particular method of load balancing used does not substantially affect performance

NASA Technical Reports Server

Reliable low latency I/O in torus-based interconnection networks

Author: Azeez Babatunde
Publication venue: Texas A&M University
Publication date: 25/04/2007
Field of study

In today's high performance computing environment I/O remains the main bottleneck in achieving the optimal performance expected of the ever improving processor and memory technologies. Interconnection networks therefore combines processing units, system I/O and high speed switch network fabric into a new paradigm of I/O based network. It decouples the system into computational and I/O interconnections each allowing "any-to-any" communications among processors and I/O devices unlike the shared model in bus architecture. The computational interconnection, a network of processing units (compute-nodes), is used for inter-processor communication in carrying out computation tasks, while the I/O interconnection manages the transfer of I/O requests between the compute-nodes and the I/O or storage media through some dedicated I/O processing units (I /O-nodes). Considering the special functions performed by the I/O nodes, their placement and reliability become important issues in improving the overall performance of the interconnection system. This thesis focuses on design and topological placement of I/O-nodes in torus based interconnection networks, with the aim of reducing I/O communication latency between compute-nodes and I/O-nodes even in the presence of faulty I/O-nodes. We propose an efficient and scalable relaxed quasi-perfect placement scheme using Lee distance error correction code such that compute-nodes are at distance-t or at most distance-t+1 from an I/O-node for a given t. This scheme provides a better and optimal alternative placement than quasi perfect placement when perfect placement cannot be found for a particular torus. Furthermore, in the occurrence of faulty I/O-nodes, the placement scheme is also used in determining other alternative I/O-nodes for rerouting I/O traffic from affected compute-nodes with minimal slowdown. In order to guarantee the quality of service required of inter-processor communication, a scheduling algorithm was developed at the router level to prioritize message forwarding according to inter-process and I/O messages with the former given higher priority. Our simulation results show that relaxed quasi-perfect outperforms quasi-perfect and the conventional I/O placement (where I/O nodes are concentrated at the base of the torus interconnection) with little degradation in inter-process communication performance. Also the fault tolerant redirection scheme provides a minimal slowdown, especially when the number of faulty I/O nodes is less than half of the initial available I/O nodes

Texas A&M Repository

A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units

Author: Novaković Vedran
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 27/09/2014
Field of study

We present a hierarchically blocked one-sided Jacobi algorithm for the singular value decomposition (SVD), targeting both single and multiple graphics processing units (GPUs). The blocking structure reflects the levels of GPU's memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining high relative accuracy. To this end, we developed a family of parallel pivot strategies on GPU's shared address space, but applicable also to inter-GPU communication. Unlike common hybrid approaches, our algorithm in a single GPU setting needs a CPU for the controlling purposes only, while utilizing GPU's resources to the fullest extent permitted by the hardware. When required by the problem size, the algorithm, in principle, scales to an arbitrary number of GPU nodes. The scalability is demonstrated by more than twofold speedup for sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin

arXiv.org e-Print Archive

CiteSeerX

Distributed String Sorting Algorithms

Author: Schimek Matthias
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2019
Field of study

KITopen

Intensive hypercube communication Prearranged communication in link-bound machines,

Author: Baru
Bruce Wagar
Cybenko
Gustafson
Hayes
Ho
Ho
Ho
Ho
Johnsson
Quentin F. Stout
Saad
Valiant
Valiant
Wagar
Publication venue: 'Elsevier BV'
Publication date: 01/01/1990
Field of study

Hypercube algorithms are developed for a variety of communication-intensive tasks such as transposing a matrix, histogramming, sending a (long) message from one node to another, broadcasting a message from one node to all others, broadcasting a message from each node to all others, and exchanging messages between nodes via a fixed permutation. The algorithm for exchanging via a fixed permutation can be viewed as a deterministic analog of Valiant's randomized routing. The algorithms are for link-bound hypercubes in which local processing time is ignored, communication time predominates, message headers are not needed because all nodes know the task being performed, and all nodes can use all communication links simultaneously. Through systematic use of techniques such as pipelining, hatching, variable packet sizing, symmetrizing, and completing, for all these problems algorithms which achieve a time with an optimal highest-order term are obtained.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/28830/1/0000664.pd

CiteSeerX

Crossref

Deep Blue Documents at the University of Michigan

Implementation of a fully-balanced periodic tridiagonal solver on a parallel distributed memory architecture

Author: Eidson T. M.
Erlebacher G.
Publication venue
Publication date
Field of study

While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem - a periodic tridiagonal solver - are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate the various strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus a Gaussian elimination type algorithm can be used in a parallel computation and the more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data is moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm allow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian elimination type algorithm into the parallel chained, load-balanced algorithm

NASA Technical Reports Server

Architectural study of high-speed networks with optical bypassing

Author: Saengudomlert Poompat, 1973-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2002
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 155-158).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.We study the routing and wavelength assignment (RWA) problem in wavelength division multiplexing (WDM) networks with no wavelength conversion. In a high-speed core network, the traffic can be separated into two components. The first is the aggregated traffic from a large number of small-rate users. Each individual session is not necessarily static but the combined traffic streams between each pair of access nodes are approximately static. We support this traffic by static provisioning of routes and wavelengths. In particular, we develop several off-line RWA algorithms which use the minimum number of wavelengths to provide I dedicated wavelength paths between each pair of access nodes for basic all-to-all connectivity. The topologies we consider are arbitrary tree, bidirectional ring, two-dimensional torus, and binary hypercube topologies. We observe that wavelength converters do not decrease the wavelength requirement to support this uniform all-to-all traffic. The second traffic component contains traffic sessions from a small number of large-rate users and cannot be well approximated as static due to insufficient aggregation. To support this traffic component, we perform dynamic provisioning of routes and wavelengths. Adopting a nonblocking formulation, we assume that the basic traffic unit is a wavelength, and the traffic matrix changes from time to time but always belongs to a given traffic set.(cont.) More specifically, let N be the number of access nodes, and k denote an integer vector [k, k2, ..., kN]. We define the set of k-allowable traffic matrices to be such that, in each traffic matrix, node i, </= 1 </= N, can transmit at most ki wavelengths and receive at most ki wavelengths. We develop several on-line RWA algorithms which can support all the k-allowable traffic matrices in a rearrangeably nonblocking fashion while using close to the minimum number of wavelengths and incurring few rearrangements of existing lightpaths, if any, for each new session request. The topologies we consider are the same as for static provisioning. We observe that the number of lightpath rearrangements per new session request is proportional to the maximum number of lightpaths supported on a single wavelength. In addition, we observe that the number of lightpath rearrangements depends on the topological properties, e.g. network size, but not on the traffic volume represented by k as we increase k by some integer factor. Finally, we begin exploring an RWA problem in which traffic is switched in bands of wavelengths rather than individual wavelengths. We present some preliminary results based on the star topology.by Poompat Saengudomlert.Ph.D

DSpace@MIT