59 research outputs found

    Design and analysis of fault-tolerant multibus interconnection networks

    Get PDF
    AbstractIn this paper a new class of fault-tolerant multibus interconnection networks is presented and analyzed. Efficiency and fault tolerance have been the driving forces in the design of these structures. The most common types of faults have been explicitly considered and in particular the jabbering problem has been adequately resolved. The analysis covers the evaluation of capacity, throughput and average delay and it includes faults of one or more channels. The system is shown to be very efficient and to be able to adequately support channel and station faults

    Towards An Efficient Cloud Computing System: Data Management, Resource Allocation and Job Scheduling

    Get PDF
    Cloud computing is an emerging technology in distributed computing, and it has proved to be an effective infrastructure to provide services to users. Cloud is developing day by day and faces many challenges. One of challenges is to build cost-effective data management system that can ensure high data availability while maintaining consistency. Another challenge in cloud is efficient resource allocation which ensures high resource utilization and high SLO availability. Scheduling, referring to a set of policies to control the order of the work to be performed by a computer system, for high throughput is another challenge. In this dissertation, we study how to manage data and improve data availability while reducing cost (i.e., consistency maintenance cost and storage cost); how to efficiently manage the resource for processing jobs and increase the resource utilization with high SLO availability; how to design an efficient scheduling algorithm which provides high throughput, low overhead while satisfying the demands on completion time of jobs. Replication is a common approach to enhance data availability in cloud storage systems. Previously proposed replication schemes cannot effectively handle both correlated and non-correlated machine failures while increasing the data availability with the limited resource. The schemes for correlated machine failures must create a constant number of replicas for each data object, which neglects diverse data popularities and cannot utilize the resource to maximize the expected data availability. Also, the previous schemes neglect the consistency maintenance cost and the storage cost caused by replication. It is critical for cloud providers to maximize data availability hence minimize SLA (Service Level Agreement) violations while minimize cost caused by replication in order to maximize the revenue. In this dissertation, we build a nonlinear programming model to maximize data availability in both types of failures and minimize the cost caused by replication. Based on the model\u27s solution for the replication degree of each data object, we propose a low-cost multi-failure resilient replication scheme (MRR). MRR can effectively handle both correlated and non-correlated machine failures, considers data popularities to enhance data availability, and also tries to minimize consistency maintenance and storage cost. In current cloud, providers still need to reserve resources to allow users to scale on demand. The capacity offered by cloud offerings is in the form of pre-defined virtual machine (VM) configurations. This incurs resource wastage and results in low resource utilization when the users actually consume much less resource than the VM capacity. Existing works either reallocate the unused resources with no Service Level Objectives (SLOs) for availability\footnote{Availability refers to the probability of an allocated resource being remain operational and accessible during the validity of the contract~\cite{CarvalhoCirne14}.} or consider SLOs to reallocate the unused resources for long-running service jobs. This approach increases the allocated resource whenever it detects that SLO is violated in order to achieve SLO in the long term, neglecting the frequent fluctuations of jobs\u27 resource requirements in real-time application especially for short-term jobs that require fast responses and decision making for resource allocation. Thus, this approach cannot fully utilize the resources to process data because they cannot quickly adjust the resource allocation strategy dealing with the fluctuations of jobs\u27 resource requirements. What\u27s more, the previous opportunistic based resource allocation approach aims at providing long-term availability SLOs with good QoS for long-running jobs, which ensures that the jobs can be finished within weeks or months by providing slighted degraded resources with moderate availability guarantees, but it ignores deadline constraints in defining Quality of Service (QoS) for short-lived jobs requiring online responses in real-time application, thus it cannot truly guarantee the QoS and long-term availability SLOs. To overcome the drawbacks of previous works, we adequately consider the fluctuations of unused resource caused by bursts of jobs\u27 resource demands, and present a cooperative opportunistic resource provisioning (CORP) scheme to dynamically allocate the resource to jobs. CORP leverages complementarity of jobs\u27 requirements on different resource types and utilizes the job packing to reduce the resource wastage and increase the resource utilization. An increasing number of large-scale data analytics frameworks move towards larger degrees of parallelism aiming at high throughput. Scheduling that assigns tasks to workers and preemption that suspends low-priority tasks and runs high-priority tasks are two important functions in such frameworks. There are many existing works on scheduling and preemption in literature to provide high throughput. However, previous works do not substantially consider dependency in increasing throughput in scheduling or preemption. Considering dependency is crucial to increase the overall throughput. Besides, extensive task evictions for preemption increase context switches, which may decrease the throughput. To address the above problems, we propose an efficient scheduling system Dependency-aware Scheduling and Preemption (DSP) to achieve high throughput in scheduling and preemption. First, we build a mathematical model to minimize the makespan with the consideration of task dependency, and derive the target workers for tasks which can minimize the makespan; second, we utilize task dependency information to determine tasks\u27 priorities for preemption; finally, we present a probabilistic based preemption to reduce the numerous preemptions, while satisfying the demands on completion time of jobs. We conduct trace driven simulations on a real-cluster and real-world experiments on Amazon S3/EC2 to demonstrate the efficiency and effectiveness of our proposed system in comparison with other systems. The experimental results show the superior performance of our proposed system. In the future, we will further consider data update frequency to reduce consistency maintenance cost, and we will consider the effects of node joining and node leaving. Also we will consider energy consumption of machines and design an optimal replication scheme to improve data availability while saving power. For resource allocation, we will consider using the greedy approach for deep learning to reduce the computation overhead caused by the deep neural network. Also, we will additionally consider the heterogeneity of jobs (i.e., short jobs and long jobs), and use a hybrid resource allocation strategy to provide SLO availability customization for different job types while increasing the resource utilization. For scheduling, we will aim to handle scheduling tasks with partial dependency, worker failures in scheduling and make our DSP fully distributed to increase its scalability. Finally, we plan to use different workloads and real-world experiment to fully test the performance of our methods and make our preliminary system design more mature

    Fifteen years of quantum LDPC coding and improved decoding strategies

    No full text
    The near-capacity performance of classical low-density parity check (LDPC) codes and their efficient iterative decoding makes quantum LDPC (QLPDC) codes a promising candidate for quantum error correction. In this paper, we present a comprehensive survey of QLDPC codes from the perspective of code design as well as in terms of their decoding algorithms. We also conceive a modified non-binary decoding algorithm for homogeneous Calderbank-Shor-Steane-type QLDPC codes, which is capable of alleviating the problems imposed by the unavoidable length-four cycles. Our modified decoder outperforms the state-of-the-art decoders in terms of their word error rate performance, despite imposing a reduced decoding complexity. Finally, we intricately amalgamate our modified decoder with the classic uniformly reweighted belief propagation for the sake of achieving an improved performance

    A Tutorial on RAID Storage Systems

    Get PDF
    RAID storage systems have been in use since the early 1990's. Recently, however, as the demand for huge amounts of on-line storage has increased, RAID has once again come into focus. This report reviews the history of RAID, as well as where and how RAID systems fit in the storage hierarchy of an Enterprize Computing System (EIS). We describe the known RAID configurations and the advantages and disadvantages of each. Since the focus of our research is on the performance of RAID systems we devote a section to the various factors which affect RAID performance. Modelling RAID systems for their performance analysis is the topic of the next section and we report on the issues as well as briefly describe one simulator, RAIDframe, which has been developed. We conclude with section which describes the current open research questions in the area

    Performability: a retrospective and some pointers to the future

    Full text link
    As computing and communication systems become physically and logically more complex, their evaluation calls for continued innovation with regard to measure definition, model construction/solution, and tool development. In particular, the performance of such systems is often degradable, i.e., internal or external faults can reduce the quality of a delivered service even though that service, according to its specification, remains proper (failure-free). The need to accommodate this property, using model-based evaluation methods, was the raison d'etre for the concept of performability. To set the stage for additional progress in its development, we present a retrospective of associated theory, techniques, and applications resulting from work in this area over the past decade and a half. Based on what has been learned, some pointers are made to future directions which might further enhance the effectiveness of these methods and broaden their scope of applicability.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/30223/1/0000615.pd

    Scalability of RAID systems

    Get PDF
    RAID systems (Redundant Arrays of Inexpensive Disks) have dominated backend storage systems for more than two decades and have grown continuously in size and complexity. Currently they face unprecedented challenges from data intensive applications such as image processing, transaction processing and data warehousing. As the size of RAID systems increases, designers are faced with both performance and reliability challenges. These challenges include limited back-end network bandwidth, physical interconnect failures, correlated disk failures and long disk reconstruction time. This thesis studies the scalability of RAID systems in terms of both performance and reliability through simulation, using a discrete event driven simulator for RAID systems (SIMRAID) developed as part of this project. SIMRAID incorporates two benchmark workload generators, based on the SPC-1 and Iometer benchmark specifications. Each component of SIMRAID is highly parameterised, enabling it to explore a large design space. To improve the simulation speed, SIMRAID develops a set of abstraction techniques to extract the behaviour of the interconnection protocol without losing accuracy. Finally, to meet the technology trend toward heterogeneous storage architectures, SIMRAID develops a framework that allows easy modelling of different types of device and interconnection technique. Simulation experiments were first carried out on performance aspects of scalability. They were designed to answer two questions: (1) given a number of disks, which factors affect back-end network bandwidth requirements; (2) given an interconnection network, how many disks can be connected to the system. The results show that the bandwidth requirement per disk is primarily determined by workload features and stripe unit size (a smaller stripe unit size has better scalability than a larger one), with cache size and RAID algorithm having very little effect on this value. The maximum number of disks is limited, as would be expected, by the back-end network bandwidth. Studies of reliability have led to three proposals to improve the reliability and scalability of RAID systems. Firstly, a novel data layout called PCDSDF is proposed. PCDSDF combines the advantages of orthogonal data layouts and parity declustering data layouts, so that it can not only survivemultiple disk failures caused by physical interconnect failures or correlated disk failures, but also has a good degraded and rebuild performance. The generating process of PCDSDF is deterministic and time-efficient. The number of stripes per rotation (namely the number of stripes to achieve rebuild workload balance) is small. Analysis shows that the PCDSDF data layout can significantly improve the system reliability. Simulations performed on SIMRAID confirm the good performance of PCDSDF, which is comparable to other parity declustering data layouts, such as RELPR. Secondly, a system architecture and rebuilding mechanism have been designed, aimed at fast disk reconstruction. This architecture is based on parity declustering data layouts and a disk-oriented reconstruction algorithm. It uses stripe groups instead of stripes as the basic distribution unit so that it can make use of the sequential nature of the rebuilding workload. The design space of system factors such as parity declustering ratio, chunk size, private buffer size of surviving disks and free buffer size are explored to provide guidelines for storage system design. Thirdly, an efficient distributed hot spare allocation and assignment algorithm for general parity declustering data layouts has been developed. This algorithm avoids conflict problems in the process of assigning distributed spare space for the units on the failed disk. Simulation results show that it effectively solves the write bottleneck problem and, at the same time, there is only a small increase in the average response time to user requests

    Techniques for the realization of ultra- reliable spaceborne computer Final report

    Get PDF
    Bibliography and new techniques for use of error correction and redundancy to improve reliability of spaceborne computer

    Aeronautical Engineering: A continuing bibliography with indexes, supplement 113, September 1979

    Get PDF
    This bibliography lists 436 reports, articles, and other documents introduced into the NASA scientific and technical information system in August 1979
    corecore