8,439 research outputs found
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
Non-locality and Communication Complexity
Quantum information processing is the emerging field that defines and
realizes computing devices that make use of quantum mechanical principles, like
the superposition principle, entanglement, and interference. In this review we
study the information counterpart of computing. The abstract form of the
distributed computing setting is called communication complexity. It studies
the amount of information, in terms of bits or in our case qubits, that two
spatially separated computing devices need to exchange in order to perform some
computational task. Surprisingly, quantum mechanics can be used to obtain
dramatic advantages for such tasks.
We review the area of quantum communication complexity, and show how it
connects the foundational physics questions regarding non-locality with those
of communication complexity studied in theoretical computer science. The first
examples exhibiting the advantage of the use of qubits in distributed
information-processing tasks were based on non-locality tests. However, by now
the field has produced strong and interesting quantum protocols and algorithms
of its own that demonstrate that entanglement, although it cannot be used to
replace communication, can be used to reduce the communication exponentially.
In turn, these new advances yield a new outlook on the foundations of physics,
and could even yield new proposals for experiments that test the foundations of
physics.Comment: Survey paper, 63 pages LaTeX. A reformatted version will appear in
Reviews of Modern Physic
Distributed match-making
In many distributed computing environments, processes are concurrently executed by nodes in a store- and-forward communication network. Distributed control issues as diverse as name server, mutual exclusion, and replicated data management involve making matches between such processes. We propose a formal problem called distributed match-making as the generic paradigm. Algorithms for distributed match-making are developed and the complexity is investigated in terms of messages and in terms of storage needed. Lower bounds on the complexity of distributed match-making are established. Optimal algorithms, or nearly optimal algorithms, are given for particular network topologies
Optimizing message-passing performance within symmetric multiprocessor systems
The Message Passing Interface (MPI) has been widely used in the area of parallel computing due to its portability, scalability, and ease of use. Message passing within Symmetric Multiprocessor (SMP) systems is an import part of any MPI library since it enables parallel programs to run efficiently on SMP systems, or clusters of SMP systems when combined with other ways of communication such as TCP/IP. Most message-passing implementations use a shared memory pool as an intermediate buffer to hold messages, some lock mechanisms to protect the pool, and some synchronization mechanism for coordinating the processes. However, the performance varies significantly depending on how these are implemented. The work here implements two SMP message-passing modules using lock-based and lock-free approaches for MPLiÌČte, a compact library that implements a subset of the most commonly used MPI functions. Various optimization techniques have been used to optimize the performance. These two modules are evaluated using a communication performance analysis tool called NetPIPE, and compared with the implementations of other MPI libraries such as MPICH, MPICH2, LAM/MPI and MPI/PRO. Performance tools such as PAPI and VTune are used to gather some runtime information at the hardware level. This information together with some cache theory and the hardware configuration is used to explain various performance phenomena. Tests using a real application have shown the performance of the different implementations in real practice. These results all show that the improvements of the new techniques over existing implementations
Optimal Collision/Conflict-free Distance-2 Coloring in Synchronous Broadcast/Receive Tree Networks
This article is on message-passing systems where communication is (a)
synchronous and (b) based on the "broadcast/receive" pair of communication
operations. "Synchronous" means that time is discrete and appears as a sequence
of time slots (or rounds) such that each message is received in the very same
round in which it is sent. "Broadcast/receive" means that during a round a
process can either broadcast a message to its neighbors or receive a message
from one of them. In such a communication model, no two neighbors of the same
process, nor a process and any of its neighbors, must be allowed to broadcast
during the same time slot (thereby preventing message collisions in the first
case, and message conflicts in the second case). From a graph theory point of
view, the allocation of slots to processes is know as the distance-2 coloring
problem: a color must be associated with each process (defining the time slots
in which it will be allowed to broadcast) in such a way that any two processes
at distance at most 2 obtain different colors, while the total number of colors
is "as small as possible". The paper presents a parallel message-passing
distance-2 coloring algorithm suited to trees, whose roots are dynamically
defined. This algorithm, which is itself collision-free and conflict-free, uses
colors where is the maximal degree of the graph (hence
the algorithm is color-optimal). It does not require all processes to have
different initial identities, and its time complexity is , where d
is the depth of the tree. As far as we know, this is the first distributed
distance-2 coloring algorithm designed for the broadcast/receive round-based
communication model, which owns all the previous properties.Comment: 19 pages including one appendix. One Figur
Predictable migration and communication in the Quest-V multikernal
Quest-V is a system we have been developing from the ground up, with objectives focusing on safety, predictability and efficiency. It is designed to work on emerging multicore processors with hardware virtualization support. Quest-V is implemented as a ``distributed system on a chip'' and comprises multiple sandbox kernels. Sandbox kernels are isolated from one another in separate regions of physical memory, having access to a subset of processing cores and I/O devices. This partitioning prevents system failures in one sandbox affecting the operation of other sandboxes. Shared memory channels managed by system monitors enable inter-sandbox communication.
The distributed nature of Quest-V means each sandbox has a separate physical clock, with all event timings being managed by per-core local timers. Each sandbox is responsible for its own scheduling and I/O management, without requiring intervention of a hypervisor. In this paper, we formulate bounds on inter-sandbox communication in the absence of a global scheduler or global system clock. We also describe how address space migration between sandboxes can be guaranteed without violating service constraints. Experimental results on a working system show the conditions under which Quest-V performs real-time communication and migration.National Science Foundation (1117025
A distributed programming environment for Ada
Despite considerable commercial exploitation of fault tolerance systems, significant and difficult research problems remain in such areas as fault detection and correction. A research project is described which constructs a distributed computing test bed for loosely coupled computers. The project is constructing a tool kit to support research into distributed control algorithms, including a distributed Ada compiler, distributed debugger, test harnesses, and environment monitors. The Ada compiler is being written in Ada and will implement distributed computing at the subsystem level. The design goal is to provide a variety of control mechanics for distributed programming while retaining total transparency at the code level
Pervasive Parallel And Distributed Computing In A Liberal Arts College Curriculum
We present a model for incorporating parallel and distributed computing (PDC) throughout an undergraduate CS curriculum. Our curriculum is designed to introduce students early to parallel and distributed computing topics and to expose students to these topics repeatedly in the context of a wide variety of CS courses. The key to our approach is the development of a required intermediate-level course that serves as a introduction to computer systems and parallel computing. It serves as a requirement for every CS major and minor and is a prerequisite to upper-level courses that expand on parallel and distributed computing topics in different contexts. With the addition of this new course, we are able to easily make room in upper-level courses to add and expand parallel and distributed computing topics. The goal of our curricular design is to ensure that every graduating CS major has exposure to parallel and distributed computing, with both a breadth and depth of coverage. Our curriculum is particularly designed for the constraints of a small liberal arts college, however, much of its ideas and its design are applicable to any undergraduate CS curriculum
CCL: a portable and tunable collective communication library for scalable parallel computers
A collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel computer products by IBM, has been designed. CCL is part of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: 1) the introduction of process groups, 2) the definition of semantics that ensures correctness, and 3) the design of new and tunable algorithms based on a realistic point-to-point communication model
- âŠ