552,734 research outputs found
Computing in the RAIN: a reliable array of independent nodes
The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology
A distributed file service based on optimistic concurrency control
The design of a layered file service for the Amoeba Distributed System is discussed, on top of which various applications can easily be intplemented. The bottom layer is formed by the Amoeba Block Services, responsible for implementing stable storage and repficated, highly available disk blocks. The next layer is formed by the Amoeba File Service which provides version management and concurrency control for tree-structured files. On top of this layer, the appficafions, ranging from databases to source code control systems, determine the structure of the file trees and provide an interface to the users
Enabling Adaptive Grid Scheduling and Resource Management
Wider adoption of the Grid concept has led to an increasing amount of federated
computational, storage and visualisation resources being available to scientists and
researchers. Distributed and heterogeneous nature of these resources renders most of the
legacy cluster monitoring and management approaches inappropriate, and poses new
challenges in workflow scheduling on such systems. Effective resource utilisation monitoring
and highly granular yet adaptive measurements are prerequisites for a more efficient Grid
scheduler. We present a suite of measurement applications able to monitor per-process
resource utilisation, and a customisable tool for emulating observed utilisation models. We
also outline our future work on a predictive and probabilistic Grid scheduler. The research is
undertaken as part of UK e-Science EPSRC sponsored project SO-GRM (Self-Organising
Grid Resource Management) in cooperation with BT
Design and Implementation of Highly Available Information Systems
Advances in information technology have created a broad spectrum of opportunities to increase uptime and reduce outages of critical information systems. Many applications are now designed with features that reduce the complexity of implementing redundancy. Advances in network technology have increased the availability and cost effectiveness of implementing geographic redundancy. Replication features are now commonly available in storage systems. Virtualization technologies have created opportunities to move servers electronically across the network, rather than physically. Cloud Computing services have become mature and abundant, creating opportunities to outsource strategic functions to a distributed ‘cloud’. With proper analysis and design, Highly Available Information Systems are within reach for most organizations
Recommended from our members
Microgrid availability during natural disasters
textA common issue with the power grid during natural disasters is low availability. Many critical applications that are required during and after natural disasters, for rescue and logistical operations require highly available power supplies. Microgrids with distributed generation resources along with the grid provide promising solutions in order to improve the availability of power supply during natural disasters. However, distributed generators (DGs) such as diesel gensets depend on lifelines such as transportation networks whose behavior during disasters affects the genset fuel delivery systems and as a result affect the availability. Renewable sources depend on natural phenomena that have both deterministic as well as stochastic aspects to their behavior, which usually results in high variability in the output. Therefore DGs require energy storage in order to make them dispatchable sources. The microgrids availability depends on the availability characteristics of its distributed generators and energy storage and their dependent infrastructure, the distribution architecture and the power electronic interfaces. This dissertation presents models to evaluate the availability of power supply from the various distributed energy resources of a microgrid during natural disasters. The stochastic behavior of the distributed generators, storage and interfaces are modeled using Markov processes and the effect of the distribution network on availability is also considered. The presented models supported by empirical data can be hence used for microgrid planning.Electrical and Computer Engineerin
Recommended from our members
Building Distributed Systems with Non-Volatile Main Memories and RDMA Networks
High-performance, byte-addressable non-volatile main memories (NVMMs) allow application developers to combine storage and memory into a single layer. These high-performance storage systems would be especially useful in large-scale data center environments where data is distributed and replicated across multiple servers.Unfortunately, existing approaches of providing remote storage access rest on the assumption that storage is slow, so the cost of the software and protocols is acceptable. Such assumption no longer holds for the fast NVMM. As a result, taking full advantage of NVMMs’ potential will require changes in system software and networking protocol. This thesis focuses on accessing remote NVMM efficiently using remote direct memory access (RDMA) network. RDMA enables a client to directly access memory on a remote machine without involving its local CPU.This thesis first presents Mojim, a system that provides replicated, reliable, and highly-available NVMM as an operating system service. Applications can access data in Mojim using normal load and store instructions while controlling when and how updates propagate to replicas using system calls. Our evaluation shows Mojim adds little overhead to the un-replicated system and provides 0.4x to 2.7x the throughput of the un-replicated system.This thesis then presents Orion, a distributed file system designed from for NVMM and RDMA networks. Traditional distributed file systems are designed for slower hard drives. These slower media incentivizes complex optimizations (e.g., queuing, striping, and batching) around disk accesses. Orion combines file system functions and network operations into a single layer. It provides low latency metadata accesses and outperforms existing distributed file systems by a large margin.Finally, an NVMM application can map files backed by an NVMM file system into its address space, and accesses them using CPU instructions. In this case, RDMA and NVMM file systems introduce duplication of effort on permissions, naming, and address translation. We introduce two changes to the existing RDMA protocol: the file memory region (FileMR) and range based address translation. By eliminating redundant translations, FileMR minimizes the number of translations done at the NIC, reducing the load on the NIC’s translation cache and resulting in application performance improvement by 1.8x - 2.0x
TOWARDS DIGITAL TWINS FOR OPTIMIZING METRICS IN DISTRIBUTED STORAGE SYSTEMS - A REVIEW
With the exponential data growth, there is a crucial need for highly available, scalable, reliable, and cost-effective Distributed Storage Systems (DSSs). To ensure such efficient and fault tolerant systems, replication and erasure coding techniques are typically used in traditional DSSs. However, these systems are prone to failure and require different failure prevention and recovery algorithms. Failure recovery of DSS and data reconstruction techniques take into consideration different performance metrics optimization in the recovery process. In this paper, DSS performance metrics are introduced. Several recent papers related to adopting erasure coding in DSSs are surveyed together with highlighting related performance metrics introduced in the context of these papers. Next, we present recent literature where Digital Twins (DTs) are involved in monitoring DSSs and assisting the data center managers in intelligent decision-making. Finally, important open issues are identified to inspire future studies for fully efficient DSSs
Coordination Protocols for Verifiable Consistency in Distributed Storage Systems
Achieving consistency in a highly available distributed storage system has been formally proven to be an impossible task when the system faces network partitions and faulty processes. The complexity is exacerbated when the system allows concurrent processes to send transactions to all the other servers and coordinate the consistent commitment of different transactions. In the event of a partition, each server may allow clients to request updates involving the current state of the data, which makes achieving replicated consistency challenging. To solve the inconsistency problems, several consensus protocols are used, but have strict requirements in order to make progress and are not guaranteed to ever converge to a single value. Additionally, the coordination required to achieve consistency after a partition will be extremely high as each node must compare transaction times and conflicting data with all other servers in the system. To address the inconsistency in distributed systems, this thesis proposes a new coordination protocol that utilizes four ideas in order for clients to verify the consistency of data: (1) a universal timestamp signatory to certify the global order of events, (2) a relative consistency indicator to determine relative consistency during partitions, (3) an operation-based recency-weighted conflict resolution algorithm to simplify coordination for achieving global consistency, and (4) a rejection-oriented distributed transaction commit protocol to eliminate any guarantees required by atomic commit protocols and verify local consistency. This thesis will evaluate and analyze various issues related to coordination and concurrency under network partitions. The experimental results demonstrate that the proposed methods provide a verifiable consistency to the servers of the distributed storage systems
- …