254 research outputs found
Parallelized reliability estimation of reconfigurable computer networks
A parallelized system, ASSURE, for computing the reliability of embedded avionics flight control systems which are able to reconfigure themselves in the event of failure is described. ASSURE accepts a grammar that describes a reliability semi-Markov state-space. From this it creates a parallel program that simultaneously generates and analyzes the state-space, placing upper and lower bounds on the probability of system failure. ASSURE is implemented on a 32-node Intel iPSC/860, and has achieved high processor efficiencies on real problems. Through a combination of improved algorithms, exploitation of parallelism, and use of an advanced microprocessor architecture, ASSURE has reduced the execution time on substantial problems by a factor of one thousand over previous workstation implementations. Furthermore, ASSURE's parallel execution rate on the iPSC/860 is an order of magnitude faster than its serial execution rate on a Cray-2 supercomputer. While dynamic load balancing is necessary for ASSURE's good performance, it is needed only infrequently; the particular method of load balancing used does not substantially affect performance
Reliable low latency I/O in torus-based interconnection networks
In today's high performance computing environment I/O remains the main bottleneck in
achieving the optimal performance expected of the ever improving processor and
memory technologies. Interconnection networks therefore combines processing units,
system I/O and high speed switch network fabric into a new paradigm of I/O based
network. It decouples the system into computational and I/O interconnections each
allowing "any-to-any" communications among processors and I/O devices unlike the
shared model in bus architecture. The computational interconnection, a network of
processing units (compute-nodes), is used for inter-processor communication in carrying
out computation tasks, while the I/O interconnection manages the transfer of I/O requests
between the compute-nodes and the I/O or storage media through some dedicated I/O
processing units (I /O-nodes). Considering the special functions performed by the I/O
nodes, their placement and reliability become important issues in improving the overall
performance of the interconnection system.
This thesis focuses on design and topological placement of I/O-nodes in torus based
interconnection networks, with the aim of reducing I/O communication latency between
compute-nodes and I/O-nodes even in the presence of faulty I/O-nodes. We propose an
efficient and scalable relaxed quasi-perfect placement scheme using Lee distance error
correction code such that compute-nodes are at distance-t or at most distance-t+1 from an
I/O-node for a given t. This scheme provides a better and optimal alternative placement
than quasi perfect placement when perfect placement cannot be found for a particular
torus. Furthermore, in the occurrence of faulty I/O-nodes, the placement scheme is also
used in determining other alternative I/O-nodes for rerouting I/O traffic from affected
compute-nodes with minimal slowdown. In order to guarantee the quality of service
required of inter-processor communication, a scheduling algorithm was developed at the router level to prioritize message forwarding according to inter-process and I/O messages
with the former given higher priority.
Our simulation results show that relaxed quasi-perfect outperforms quasi-perfect and the
conventional I/O placement (where I/O nodes are concentrated at the base of the torus
interconnection) with little degradation in inter-process communication performance.
Also the fault tolerant redirection scheme provides a minimal slowdown, especially when
the number of faulty I/O nodes is less than half of the initial available I/O nodes
A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units
We present a hierarchically blocked one-sided Jacobi algorithm for the
singular value decomposition (SVD), targeting both single and multiple graphics
processing units (GPUs). The blocking structure reflects the levels of GPU's
memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining
high relative accuracy. To this end, we developed a family of parallel pivot
strategies on GPU's shared address space, but applicable also to inter-GPU
communication. Unlike common hybrid approaches, our algorithm in a single GPU
setting needs a CPU for the controlling purposes only, while utilizing GPU's
resources to the fullest extent permitted by the hardware. When required by the
problem size, the algorithm, in principle, scales to an arbitrary number of GPU
nodes. The scalability is demonstrated by more than twofold speedup for
sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single
Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin
Intensive hypercube communication Prearranged communication in link-bound machines,
Hypercube algorithms are developed for a variety of communication-intensive tasks such as transposing a matrix, histogramming, sending a (long) message from one node to another, broadcasting a message from one node to all others, broadcasting a message from each node to all others, and exchanging messages between nodes via a fixed permutation. The algorithm for exchanging via a fixed permutation can be viewed as a deterministic analog of Valiant's randomized routing. The algorithms are for link-bound hypercubes in which local processing time is ignored, communication time predominates, message headers are not needed because all nodes know the task being performed, and all nodes can use all communication links simultaneously. Through systematic use of techniques such as pipelining, hatching, variable packet sizing, symmetrizing, and completing, for all these problems algorithms which achieve a time with an optimal highest-order term are obtained.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/28830/1/0000664.pd
Implementation of a fully-balanced periodic tridiagonal solver on a parallel distributed memory architecture
While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem - a periodic tridiagonal solver - are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate the various strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus a Gaussian elimination type algorithm can be used in a parallel computation and the more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data is moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm allow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian elimination type algorithm into the parallel chained, load-balanced algorithm
Architectural study of high-speed networks with optical bypassing
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (p. 155-158).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.We study the routing and wavelength assignment (RWA) problem in wavelength division multiplexing (WDM) networks with no wavelength conversion. In a high-speed core network, the traffic can be separated into two components. The first is the aggregated traffic from a large number of small-rate users. Each individual session is not necessarily static but the combined traffic streams between each pair of access nodes are approximately static. We support this traffic by static provisioning of routes and wavelengths. In particular, we develop several off-line RWA algorithms which use the minimum number of wavelengths to provide I dedicated wavelength paths between each pair of access nodes for basic all-to-all connectivity. The topologies we consider are arbitrary tree, bidirectional ring, two-dimensional torus, and binary hypercube topologies. We observe that wavelength converters do not decrease the wavelength requirement to support this uniform all-to-all traffic. The second traffic component contains traffic sessions from a small number of large-rate users and cannot be well approximated as static due to insufficient aggregation. To support this traffic component, we perform dynamic provisioning of routes and wavelengths. Adopting a nonblocking formulation, we assume that the basic traffic unit is a wavelength, and the traffic matrix changes from time to time but always belongs to a given traffic set.(cont.) More specifically, let N be the number of access nodes, and k denote an integer vector [k, k2, ..., kN]. We define the set of k-allowable traffic matrices to be such that, in each traffic matrix, node i, </= 1 </= N, can transmit at most ki wavelengths and receive at most ki wavelengths. We develop several on-line RWA algorithms which can support all the k-allowable traffic matrices in a rearrangeably nonblocking fashion while using close to the minimum number of wavelengths and incurring few rearrangements of existing lightpaths, if any, for each new session request. The topologies we consider are the same as for static provisioning. We observe that the number of lightpath rearrangements per new session request is proportional to the maximum number of lightpaths supported on a single wavelength. In addition, we observe that the number of lightpath rearrangements depends on the topological properties, e.g. network size, but not on the traffic volume represented by k as we increase k by some integer factor. Finally, we begin exploring an RWA problem in which traffic is switched in bands of wavelengths rather than individual wavelengths. We present some preliminary results based on the star topology.by Poompat Saengudomlert.Ph.D
- …