4,263 research outputs found
Self-Healing Protocols for Connectivity Maintenance in Unstructured Overlays
In this paper, we discuss on the use of self-organizing protocols to improve
the reliability of dynamic Peer-to-Peer (P2P) overlay networks. Two similar
approaches are studied, which are based on local knowledge of the nodes' 2nd
neighborhood. The first scheme is a simple protocol requiring interactions
among nodes and their direct neighbors. The second scheme adds a check on the
Edge Clustering Coefficient (ECC), a local measure that allows determining
edges connecting different clusters in the network. The performed simulation
assessment evaluates these protocols over uniform networks, clustered networks
and scale-free networks. Different failure modes are considered. Results
demonstrate the effectiveness of the proposal.Comment: The paper has been accepted to the journal Peer-to-Peer Networking
and Applications. The final publication is available at Springer via
http://dx.doi.org/10.1007/s12083-015-0384-
Failure Detectors for Wireless Sensor-Actuator Systems
Wireless sensor-actuator systems (WSAS) offer exciting opportunities for emerging applications by facilitating fine-grained monitoring and control, and dense instrumentation. The large scale of such systems increases the need for such systems to tolerate and cope with failures, in a localized and decentralized manner. We present abstractions for detecting node failures and link failures caused by topology changes in a WSAS. These abstractions were designed and implemented as a set of reusable components in nesC under TinyOS. Results, which demonstrate the performance and viability of the abstractions, based on experiments on an 80 node testbed are presented. In the future, these abstractions can be extended to detect and cope with larger classes of failures in WSAS
Failure Detectors for Wireless Sensor-Actuator Systems
Wireless sensor-actuator systems (WSAS) offer exciting opportunities for emerging applications by facilitating fine-grained monitoring and control, and dense instrumentation. The large scale of such systems increases the need for such systems to tolerate and cope with failures, in a localized and decentralized manner. We present abstractions for detecting node failures and link failures caused by topology changes in a WSAS. These abstractions were designed and implemented as a set of reusable components in nesC under TinyOS. Results, which demonstrate the performance and viability of the abstractions, based on experiments on an 80 node testbed are presented. In the future, these abstractions can be extended to detect and cope with larger classes of failures in WSAS
LHView: Location Aware Hybrid Partial View
The rise of the Cloud creates enormous business opportunities for companies to provide
global services, which requires applications supporting the operation of those services
to scale while minimizing maintenance costs, either due to unnecessary allocation of
resources or due to excessive human supervision and administration. Solutions designed
to support such systems have tackled fundamental challenges from individual component
failure to transient network partitions. A fundamental aspect that all scalable large
systems have to deal with is the membership of the system, i.e, tracking the active components
that compose the system. Most systems rely on membership management protocols
that operate at the application level, many times exposing the interface of a logical overlay
network, that should guarantee high scalability, efficiency, and robustness.
Although these protocols are capable of repairing the overlay in face of large numbers
of individual components faults, when scaling to global settings (i.e, geo-distributed
scenarios), this robustness is a double edged-sword because it is extremely complex for
a node in a system to distinguish between a set of simultaneously node failures and a
(transient) network partition. Thus the occurrence of a network partition creates isolated
sub-sets of nodes incapable of reconnecting even after the recovery from the partition.
This work address this challenges by proposing a novel datacenter-aware membership
protocol to tolerate network partitions by applying existing overlay management techniques
and classification techniques that may allow the system to efficiently cope with
such events without compromising the remaining properties of the overlay network. Furthermore,
we strive to achieve these goals with a solution that requires minimal human
intervention
Epidemic broadcast trees
There is an inherent trade-off between epidemic and deterministic tree-based broadcast primitives. Tree-based approaches have a small message complexity in steady-state but are very fragile in the presence of faults. Gossip, or epidemic, protocols have a higher message complexity but also offer much higher resilience. This paper proposes an integrated broadcast scheme that combines both approaches. We use a low cost scheme to build and maintain broadcast trees embedded on a gossip-based overlay. The protocol sends the message payload preferably via tree branches but uses the remaining links of the gossip overlay for fast recovery and expedite tree healing. Experimental evaluation presented in the paper shows that our new strategy has a low overhead and that is able to support large number of faults while maintaining a high reliability.This work was partially supported by project P-SON: Probabilistically Structured Overlay Networks (POSC/EIA/60941/2004)
Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension
As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software.
First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications.
Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient
HyParView: a membership protocol for reliable gossip-based broadcast
Gossip, or epidemic, protocols have emerged as a powerful strategy to implement highly scalable and resilient reliable broadcast primitives. Due to scalability reasons, each participant in a gossip protocol maintains a partial view of the system. The reliability of the gossip protocol depends upon some critical properties of these views, such as degree distribution and clustering coefficient. Several algorithms have been proposed to maintain partial views for gossip protocols. In this paper, we show that under a high number of faults, these algorithms take a long time to restore the desirable view properties. To address this problem, we present HyParView, a new membership protocol to support gossip-based broadcast that ensures high levels of reliability even in the presence of high rates of node failure. The HyParView protocol is based on a novel approach that relies in the use of two distinct partial views, which are maintained with different goals by different strategies.This work was partially supported by project P-SON: Probabilistically Structured Overlay Networks (POS_C/EIA/60941/2004). Parts of this report have been published in the Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Edinburgh, UK, June, 200
- …