189,293 research outputs found

    Resilience in Large Scale Distributed Systems

    Get PDF
    Distributed systems are comprised of multiple subsystems that interact in two distinct ways: (1) physical interactions and (2) cyber interactions; i.e. sensors, actuators and computers controlling these subsystems, and the network over which they communicate. A broad class of cyber-physical systems (CPS) are described by such interactions, such as the smart grid, platoons of autonomous vehicles and the sensorimotor system. This paper will survey recent progress in developing a coherent mathematical framework that describes the rich CPS “design space” of fundamental limits and tradeoffs between efficiency, robustness, adaptation, verification and scalability. Whereas most research treats at most one of these issues, we attempt a holistic approach in examining these metrics. In particular, we will argue that a control architecture that emphasizes scalability leads to improvements in robustness, adaptation, and verification, all the while having only minor effects on efficiency – i.e. through the choice of a new architecture, we believe that we are able to bring a system closer to the true fundamental hard limits of this complex design space

    Towards Stabilization of Distributed Systems under Denial-of-Service

    Full text link
    In this paper, we consider networked distributed systems in the presence of Denial-of-Service (DoS) attacks, namely attacks that prevent transmissions over the communication network. First, we consider a simple and typical scenario where communication sequence is purely Round-robin and we explicitly calculate a bound of attack frequency and duration, under which the interconnected large-scale system is asymptotically stable. Second, trading-off system resilience and communication load, we design a hybrid transmission strategy consisting of Zeno-free distributed event-triggered control and Round-robin. We show that with lower communication loads, the hybrid communication strategy enables the systems to have the same resilience as in pure Round-robin

    Optimizing the Structure and Scale of Urban Water Infrastructure: Integrating Distributed Systems

    Get PDF
    Large-scale, centralized water infrastructure has provided clean drinking water, wastewater treatment, stormwater management and flood protection for U.S. cities and towns for many decades, protecting public health, safety and environmental quality. To accommodate increasing demands driven by population growth and industrial needs, municipalities and utilities have typically expanded centralized water systems with longer distribution and collection networks. This approach achieves financial and institutional economies of scale and allows for centralized management. It comes with tradeoffs, however, including higher energy demands for longdistance transport; extensive maintenance needs; and disruption of the hydrologic cycle, including the large-scale transfer of freshwater resources to estuarine and saline environments.While smaller-scale distributed water infrastructure has been available for quite some time, it has yet to be widely adopted in urban areas of the United States. However, interest in rethinking how to best meet our water and sanitation needs has been building. Recent technological developments and concerns about sustainability and community resilience have prompted experts to view distributed systems as complementary to centralized infrastructure, and in some situations the preferred alternative.In March 2014, the Johnson Foundation at Wingspread partnered with the Water Environment Federation and the Patel College of Global Sustainability at the University of South Florida to convene a diverse group of experts to examine the potential for distributed water infrastructure systems to be integrated with or substituted for more traditional water infrastructure, with a focus on right-sizing the structure and scale of systems and services to optimize water, energy and sanitation management while achieving long-term sustainability and resilience

    Design Strategy and Case Study of Distributed System Resilience in Chinese Context

    Get PDF
    Whether it is a large-scale complex challenge or a radical change, it is calling for a more resilient and sustainable socio-technical system. Distributed system is a new trend of sustainable transition of socio-technical system. The research on related design strategies of distributed systems can help us better understand the nature of distributed systems, the role of designers, and help designers more calmly deal with future related challenges. This paper has conducted an in-depth understanding and discussion on the resilience of soico-technical systems and the relationship between distributed systems and resilience. We selected three representative cases, combined with a series of response measures taken by China in Wuhan during COVID-19 to analyze. Three types of distributed system design strategies suitable for China are identified

    Scalable and Reliable Sparse Data Computation on Emergent High Performance Computing Systems

    Get PDF
    Heterogeneous systems with both CPUs and GPUs have become important system architectures in emergent High Performance Computing (HPC) systems. Heterogeneous systems must address both performance-scalability and power-scalability in the presence of failures. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. The future exascale systems are expected to have higher power consumption with higher fault rates. Sparse data computation is the fundamental kernel in many scientific applications. It is suitable for the studies of scalability and resilience on heterogeneous systems due to its computational characteristics. To deliver the promised performance within the given power budget, heterogeneous computing mandates a deep understanding of the interplay between scalability and resilience. Managing scalability and resilience is challenging in heterogeneous systems, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Scalability and resilience have been traditionally studied in isolation, and optimizing one typically detrimentally impacts the other. While prior works have been proved successful in optimizing scalability and resilience on CPU-based homogeneous systems, simply extending current approaches to heterogeneous systems results in suboptimal performance-scalability and/or power-scalability. To address the above multiple research challenges, we propose novel resilience and energy-efficiency technologies to optimize scalability and resilience for sparse data computation on heterogeneous systems with CPUs and GPUs. First, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes, and develop and prototype performance optimization and power management strategies to improve scalability for sparse linear solvers. Our results quantitatively reveal that each resilience scheme has its own advantages depending on the fault rate, system size, and power budget, and the forward recovery can further benefit from our performance and power optimizations for large-scale computing. Second, we design a novel resilience technique that relaxes the requirement of synchronization and identicalness for processes, and allows them to run in heterogeneous resources with power reduction. Our results show a significant reduction in energy for unmodified programs in various fault situations compared to exact replication techniques. Third, we propose a novel distributed sparse tensor decomposition that utilizes an asynchronous RDMA-based approach with OpenSHMEM to improve scalability on large-scale systems and prove that our method works well in heterogeneous systems. Our results show our irregularity-aware workload partition and balanced-asynchronous algorithms are scalable and outperform the state-of-the-art distributed implementations. We demonstrate that understanding different bottlenecks for various types of tensors plays critical roles in improving scalability

    Repairable Block Failure Resilient Codes

    Full text link
    In large scale distributed storage systems (DSS) deployed in cloud computing, correlated failures resulting in simultaneous failure (or, unavailability) of blocks of nodes are common. In such scenarios, the stored data or a content of a failed node can only be reconstructed from the available live nodes belonging to available blocks. To analyze the resilience of the system against such block failures, this work introduces the framework of Block Failure Resilient (BFR) codes, wherein the data (e.g., file in DSS) can be decoded by reading out from a same number of codeword symbols (nodes) from each available blocks of the underlying codeword. Further, repairable BFR codes are introduced, wherein any codeword symbol in a failed block can be repaired by contacting to remaining blocks in the system. Motivated from regenerating codes, file size bounds for repairable BFR codes are derived, trade-off between per node storage and repair bandwidth is analyzed, and BFR-MSR and BFR-MBR points are derived. Explicit codes achieving these two operating points for a wide set of parameters are constructed by utilizing combinatorial designs, wherein the codewords of the underlying outer codes are distributed to BFR codeword symbols according to projective planes
    • …
    corecore