460 research outputs found
Redundancy management for efficient fault recovery in NASA's distributed computing system
The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance
Integration of tools for the Design and Assessment of High-Performance, Highly Reliable Computing Systems (DAHPHRS), phase 1
Systems for Space Defense Initiative (SDI) space applications typically require both high performance and very high reliability. These requirements present the systems engineer evaluating such systems with the extremely difficult problem of conducting performance and reliability trade-offs over large design spaces. A controlled development process supported by appropriate automated tools must be used to assure that the system will meet design objectives. This report describes an investigation of methods, tools, and techniques necessary to support performance and reliability modeling for SDI systems development. Models of the JPL Hypercubes, the Encore Multimax, and the C.S. Draper Lab Fault-Tolerant Parallel Processor (FTPP) parallel-computing architectures using candidate SDI weapons-to-target assignment algorithms as workloads were built and analyzed as a means of identifying the necessary system models, how the models interact, and what experiments and analyses should be performed. As a result of this effort, weaknesses in the existing methods and tools were revealed and capabilities that will be required for both individual tools and an integrated toolset were identified
Recommended from our members
On resource placements and fault-tolerant broadcasting in toroidal networks
Parallel computers are classified into: Multiprocessors, and multicomputers. A multiprocessor system usually has a shared memory through which its processors can communicate. On the other hand, the processors of a multicomputer system communicate by message passing through an interconnection network. A widely used class of interconnection networks is the toroidal networks. Compared to a hypercube, a torus has a larger diameter, but better tradeoffs, such as higher channel bandwidth and lower node degree. Results on resource placements and fault-tolerant broadcasting in toroidal networks are presented. Given a limited number of resources, it is desirable to distribute these resources over the interconnection network so that the distance between a non-resource and a closest resource is minimized. This problem is known as distance-d placement. In such a placement, each non-resource must be within a distance of d or less from at least one resource, where the number of resources used is the least possible. Solutions for distance-d placements in 2D and 3D tori are proposed. These solutions are compared with placements used so far in practice. Simulation experiments show that the proposed solutions are superior to the placements used in practice in terms of reducing average network latency. The complexity of a multicomputer increases the chances of having processor failures. Therefore, designing fault-tolerant communication algorithms is quite necessary for a sufficient utilization of such a system. Broadcasting (single-node one-to-all) in a multicomputer is one of the important communication primitives. A non-redundant fault-tolerant broadcasting algorithm in a faulty toroidal network is designed. The algorithm can adapt up to (2n-2) processor failures. Compared to the optimal algorithm in a fault-free n-dimensional toroidal network, the proposed algorithm requires at most 3 extra communication steps using cut through packet routing, and (n + 1) extra steps using store-and-forward routing
Recommended from our members
Resource placement, data rearrangement, and Hamiltonian cycles in torus networks
Many parallel machines, both commercial and experimental, have been/are being designed with toroidal interconnection networks. For a given number of nodes, the torus has a relatively larger diameter, but better cost/performance tradeoffs, such as higher channel bandwidth, and lower node degree, when compared to the hypercube. Thus, the torus is becoming a popular topology for the interconnection network of a high performance parallel computers.
In a multicomputer, the resources, such as I/O devices or software packages, are distributed over the networks. The first part of the thesis investigates efficient methods of distributing resources in a torus network. Three classes of placement methods are studied. They are (1) distant-t placement problem: in this case, any non-resource node is at a distance of at most t from some resource nodes, (2) j-adjacency problem: here, a non-resource node is adjacent to at least j resource nodes, and (3) generalized placement problem: a non-resource node must be a distance of at most t from at least j resource nodes.
This resource placement technique can be applied to allocating spare processors to provide fault-tolerance in the case of the processor failures. Some efficient
spare processor placement methods and reconfiguration schemes in the case of processor failures are also described.
In a torus based parallel system, some algorithms give best performance if the data are distributed to processors numbered in Cartesian order; in some other cases, it is better to distribute the data to processors numbered in Gray code order. Since the placement patterns may be changed dynamically, it is essential to find efficient methods of rearranging the data from Gray code order to Cartesian order and vice versa. In the second part of the thesis, some efficient methods for data transfer from Cartesian order to radix order and vice versa are developed.
The last part of the thesis gives results on generating edge disjoint Hamiltonian cycles in k-ary n-cubes, hypercubes, and 2D tori. These edge disjoint cycles are quite useful for many communication algorithms
Reliable low latency I/O in torus-based interconnection networks
In today's high performance computing environment I/O remains the main bottleneck in
achieving the optimal performance expected of the ever improving processor and
memory technologies. Interconnection networks therefore combines processing units,
system I/O and high speed switch network fabric into a new paradigm of I/O based
network. It decouples the system into computational and I/O interconnections each
allowing "any-to-any" communications among processors and I/O devices unlike the
shared model in bus architecture. The computational interconnection, a network of
processing units (compute-nodes), is used for inter-processor communication in carrying
out computation tasks, while the I/O interconnection manages the transfer of I/O requests
between the compute-nodes and the I/O or storage media through some dedicated I/O
processing units (I /O-nodes). Considering the special functions performed by the I/O
nodes, their placement and reliability become important issues in improving the overall
performance of the interconnection system.
This thesis focuses on design and topological placement of I/O-nodes in torus based
interconnection networks, with the aim of reducing I/O communication latency between
compute-nodes and I/O-nodes even in the presence of faulty I/O-nodes. We propose an
efficient and scalable relaxed quasi-perfect placement scheme using Lee distance error
correction code such that compute-nodes are at distance-t or at most distance-t+1 from an
I/O-node for a given t. This scheme provides a better and optimal alternative placement
than quasi perfect placement when perfect placement cannot be found for a particular
torus. Furthermore, in the occurrence of faulty I/O-nodes, the placement scheme is also
used in determining other alternative I/O-nodes for rerouting I/O traffic from affected
compute-nodes with minimal slowdown. In order to guarantee the quality of service
required of inter-processor communication, a scheduling algorithm was developed at the router level to prioritize message forwarding according to inter-process and I/O messages
with the former given higher priority.
Our simulation results show that relaxed quasi-perfect outperforms quasi-perfect and the
conventional I/O placement (where I/O nodes are concentrated at the base of the torus
interconnection) with little degradation in inter-process communication performance.
Also the fault tolerant redirection scheme provides a minimal slowdown, especially when
the number of faulty I/O nodes is less than half of the initial available I/O nodes
Alpha Entanglement Codes: Practical Erasure Codes to Archive Data in Unreliable Environments
Data centres that use consumer-grade disks drives and distributed
peer-to-peer systems are unreliable environments to archive data without enough
redundancy. Most redundancy schemes are not completely effective for providing
high availability, durability and integrity in the long-term. We propose alpha
entanglement codes, a mechanism that creates a virtual layer of highly
interconnected storage devices to propagate redundant information across a
large scale storage system. Our motivation is to design flexible and practical
erasure codes with high fault-tolerance to improve data durability and
availability even in catastrophic scenarios. By flexible and practical, we mean
code settings that can be adapted to future requirements and practical
implementations with reasonable trade-offs between security, resource usage and
performance. The codes have three parameters. Alpha increases storage overhead
linearly but increases the possible paths to recover data exponentially. Two
other parameters increase fault-tolerance even further without the need of
additional storage. As a result, an entangled storage system can provide high
availability, durability and offer additional integrity: it is more difficult
to modify data undetectably. We evaluate how several redundancy schemes perform
in unreliable environments and show that alpha entanglement codes are flexible
and practical codes. Remarkably, they excel at code locality, hence, they
reduce repair costs and become less dependent on storage locations with poor
availability. Our solution outperforms Reed-Solomon codes in many disaster
recovery scenarios.Comment: The publication has 12 pages and 13 figures. This work was partially
supported by Swiss National Science Foundation SNSF Doc.Mobility 162014, 2018
48th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks (DSN
Center for Space Microelectronics Technology 1988-1989 technical report
The 1988 to 1989 Technical Report of the JPL Center for Space Microelectronics Technology summarizes the technical accomplishments, publications, presentations, and patents of the center. Listed are 321 publications, 282 presentations, and 140 new technology reports and patents
- …