Search CORE

189,293 research outputs found

Resilience in Large Scale Distributed Systems

Author: Doyle John C.
Horowitz Matanya B.
Leong Yoke Peng
Matni Nikolai
Wang Yuh Shyang
You Seungil
Publication venue: 'Elsevier BV'
Publication date: 01/03/2014
Field of study

Distributed systems are comprised of multiple subsystems that interact in two distinct ways: (1) physical interactions and (2) cyber interactions; i.e. sensors, actuators and computers controlling these subsystems, and the network over which they communicate. A broad class of cyber-physical systems (CPS) are described by such interactions, such as the smart grid, platoons of autonomous vehicles and the sensorimotor system. This paper will survey recent progress in developing a coherent mathematical framework that describes the rich CPS “design space” of fundamental limits and tradeoffs between efficiency, robustness, adaptation, verification and scalability. Whereas most research treats at most one of these issues, we attempt a holistic approach in examining these metrics. In particular, we will argue that a control architecture that emphasizes scalability leads to improvements in robustness, adaptation, and verification, all the while having only minor effects on efficiency – i.e. through the choice of a new architecture, we believe that we are able to bring a system closer to the true fundamental hard limits of this complex design space

Elsevier - Publisher Connector

Caltech Authors

Towards Stabilization of Distributed Systems under Denial-of-Service

Author: De Persis Claudio
Feng Shuai
Tesi Pietro
Publication venue
Publication date: 01/01/2017
Field of study

In this paper, we consider networked distributed systems in the presence of Denial-of-Service (DoS) attacks, namely attacks that prevent transmissions over the communication network. First, we consider a simple and typical scenario where communication sequence is purely Round-robin and we explicitly calculate a bound of attack frequency and duration, under which the interconnected large-scale system is asymptotically stable. Second, trading-off system resilience and communication load, we design a hybrid transmission strategy consisting of Zeno-free distributed event-triggered control and Round-robin. We show that with lower communication loads, the hybrid communication strategy enables the systems to have the same resilience as in pure Round-robin

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Optimizing the Structure and Scale of Urban Water Infrastructure: Integrating Distributed Systems

Author
Publication venue: Water Evironment Federation
Publication date: 08/08/2014
Field of study

Large-scale, centralized water infrastructure has provided clean drinking water, wastewater treatment, stormwater management and flood protection for U.S. cities and towns for many decades, protecting public health, safety and environmental quality. To accommodate increasing demands driven by population growth and industrial needs, municipalities and utilities have typically expanded centralized water systems with longer distribution and collection networks. This approach achieves financial and institutional economies of scale and allows for centralized management. It comes with tradeoffs, however, including higher energy demands for longdistance transport; extensive maintenance needs; and disruption of the hydrologic cycle, including the large-scale transfer of freshwater resources to estuarine and saline environments.While smaller-scale distributed water infrastructure has been available for quite some time, it has yet to be widely adopted in urban areas of the United States. However, interest in rethinking how to best meet our water and sanitation needs has been building. Recent technological developments and concerns about sustainability and community resilience have prompted experts to view distributed systems as complementary to centralized infrastructure, and in some situations the preferred alternative.In March 2014, the Johnson Foundation at Wingspread partnered with the Water Environment Federation and the Patel College of Global Sustainability at the University of South Florida to convene a diverse group of experts to examine the potential for distributed water infrastructure systems to be integrated with or substituted for more traditional water infrastructure, with a focus on right-sizing the structure and scale of systems and services to optimize water, energy and sanitation management while achieving long-term sustainability and resilience

IssueLab

Design Strategy and Case Study of Distributed System Resilience in Chinese Context

Author: Xia Nan
Publication venue: 'UNISINOS - Universidade do Vale do Rio Dos Sinos'
Publication date: 23/12/2020
Field of study

Whether it is a large-scale complex challenge or a radical change, it is calling for a more resilient and sustainable socio-technical system. Distributed system is a new trend of sustainable transition of socio-technical system. The research on related design strategies of distributed systems can help us better understand the nature of distributed systems, the role of designers, and help designers more calmly deal with future related challenges. This paper has conducted an in-depth understanding and discussion on the resilience of soico-technical systems and the relationship between distributed systems and resilience. We selected three representative cases, combined with a series of response measures taken by China in Wuhan during COVID-19 to analyze. Three types of distributed system design strategies suitable for China are identified

Unisinos (Universidade do Vale do Rio dos Sinos): SEER Unisinos

Scalable and Reliable Sparse Data Computation on Emergent High Performance Computing Systems

Author: Miao Zheng
Publication venue: Clemson University Libraries
Publication date: 01/05/2022
Field of study

Heterogeneous systems with both CPUs and GPUs have become important system architectures in emergent High Performance Computing (HPC) systems. Heterogeneous systems must address both performance-scalability and power-scalability in the presence of failures. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. The future exascale systems are expected to have higher power consumption with higher fault rates. Sparse data computation is the fundamental kernel in many scientific applications. It is suitable for the studies of scalability and resilience on heterogeneous systems due to its computational characteristics. To deliver the promised performance within the given power budget, heterogeneous computing mandates a deep understanding of the interplay between scalability and resilience. Managing scalability and resilience is challenging in heterogeneous systems, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Scalability and resilience have been traditionally studied in isolation, and optimizing one typically detrimentally impacts the other. While prior works have been proved successful in optimizing scalability and resilience on CPU-based homogeneous systems, simply extending current approaches to heterogeneous systems results in suboptimal performance-scalability and/or power-scalability. To address the above multiple research challenges, we propose novel resilience and energy-efficiency technologies to optimize scalability and resilience for sparse data computation on heterogeneous systems with CPUs and GPUs. First, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes, and develop and prototype performance optimization and power management strategies to improve scalability for sparse linear solvers. Our results quantitatively reveal that each resilience scheme has its own advantages depending on the fault rate, system size, and power budget, and the forward recovery can further benefit from our performance and power optimizations for large-scale computing. Second, we design a novel resilience technique that relaxes the requirement of synchronization and identicalness for processes, and allows them to run in heterogeneous resources with power reduction. Our results show a significant reduction in energy for unmodified programs in various fault situations compared to exact replication techniques. Third, we propose a novel distributed sparse tensor decomposition that utilizes an asynchronous RDMA-based approach with OpenSHMEM to improve scalability on large-scale systems and prove that our method works well in heterogeneous systems. Our results show our irregularity-aware workload partition and balanced-asynchronous algorithms are scalable and outperform the state-of-the-art distributed implementations. We demonstrate that understanding different bottlenecks for various types of tensors plays critical roles in improving scalability

Clemson University: TigerPrints

Repairable Block Failure Resilient Codes

Author: Calis Gokhan
Koyluoglu O. Ozan
Publication venue
Publication date: 27/06/2014
Field of study

In large scale distributed storage systems (DSS) deployed in cloud computing, correlated failures resulting in simultaneous failure (or, unavailability) of blocks of nodes are common. In such scenarios, the stored data or a content of a failed node can only be reconstructed from the available live nodes belonging to available blocks. To analyze the resilience of the system against such block failures, this work introduces the framework of Block Failure Resilient (BFR) codes, wherein the data (e.g., file in DSS) can be decoded by reading out from a same number of codeword symbols (nodes) from each available blocks of the underlying codeword. Further, repairable BFR codes are introduced, wherein any codeword symbol in a failed block can be repaired by contacting to remaining blocks in the system. Motivated from regenerating codes, file size bounds for repairable BFR codes are derived, trade-off between per node storage and repair bandwidth is analyzed, and BFR-MSR and BFR-MBR points are derived. Explicit codes achieving these two operating points for a wide set of parameters are constructed by utilizing combinatorial designs, wherein the codewords of the underlying outer codes are distributed to BFR codeword symbols according to projective planes

arXiv.org e-Print Archive

Crossref