5,550 research outputs found

    CRAID: Online RAID upgrades using dynamic hot data reorganization

    Get PDF
    Current algorithms used to upgrade RAID arrays typically require large amounts of data to be migrated, even those that move only the minimum amount of data required to keep a balanced data load. This paper presents CRAID, a self-optimizing RAID array that performs an online block reorganization of frequently used, long-term accessed data in order to reduce this migration even further. To achieve this objective, CRAID tracks frequently used, long-term data blocks and copies them to a dedicated partition spread across all the disks in the array. When new disks are added, CRAID only needs to extend this process to the new devices to redistribute this partition, thus greatly reducing the overhead of the upgrade process. In addition, the reorganized access patterns within this partition improve the array’s performance, amortizing the copy overhead and allowing CRAID to offer a performance competitive with traditional RAIDs. We describe CRAID’s motivation and design and we evaluate it by replaying seven real-world workloads including a file server, a web server and a user share. Our experiments show that CRAID can successfully detect hot data variations and begin using new disks as soon as they are added to the array. Also, the usage of a dedicated partition improves the sequentiality of relevant data access, which amortizes the cost of reorganizations. Finally, we prove that a full-HDD CRAID array with a small distributed partition (<1.28% per disk) can compete in performance with an ideally restriped RAID-5 and a hybrid RAID-5 with a small SSD cache.Peer ReviewedPostprint (published version

    April-May 2007

    Get PDF

    Computing at massive scale: Scalability and dependability challenges

    Get PDF
    Large-scale Cloud systems and big data analytics frameworks are now widely used for practical services and applications. However, with the increase of data volume, together with the heterogeneity of workloads and resources, and the dynamic nature of massive user requests, the uncertainties and complexity of resource management and service provisioning increase dramatically, often resulting in poor resource utilization, vulnerable system dependability, and user-perceived performance degradations. In this paper we report our latest understanding of the current and future challenges in this particular area, and discuss both existing and potential solutions to the problems, especially those concerned with system efficiency, scalability and dependability. We first introduce a data-driven analysis methodology for characterizing the resource and workload patterns and tracing performance bottlenecks in a massive-scale distributed computing environment. We then examine and analyze several fundamental challenges and the solutions we are developing to tackle them, including for example incremental but decentralized resource scheduling, incremental messaging communication, rapid system failover, and request handling parallelism. We integrate these solutions with our data analysis methodology in order to establish an engineering approach that facilitates the optimization, tuning and verification of massive-scale distributed systems. We aim to develop and offer innovative methods and mechanisms for future computing platforms that will provide strong support for new big data and IoE (Internet of Everything) applications

    MACHS: Mitigating the Achilles Heel of the Cloud through High Availability and Performance-aware Solutions

    Get PDF
    Cloud computing is continuously growing as a business model for hosting information and communication technology applications. However, many concerns arise regarding the quality of service (QoS) offered by the cloud. One major challenge is the high availability (HA) of cloud-based applications. The key to achieving availability requirements is to develop an approach that is immune to cloud failures while minimizing the service level agreement (SLA) violations. To this end, this thesis addresses the HA of cloud-based applications from different perspectives. First, the thesis proposes a component’s HA-ware scheduler (CHASE) to manage the deployments of carrier-grade cloud applications while maximizing their HA and satisfying the QoS requirements. Second, a Stochastic Petri Net (SPN) model is proposed to capture the stochastic characteristics of cloud services and quantify the expected availability offered by an application deployment. The SPN model is then associated with an extensible policy-driven cloud scoring system that integrates other cloud challenges (i.e. green and cost concerns) with HA objectives. The proposed HA-aware solutions are extended to include a live virtual machine migration model that provides a trade-off between the migration time and the downtime while maintaining HA objective. Furthermore, the thesis proposes a generic input template for cloud simulators, GITS, to facilitate the creation of cloud scenarios while ensuring reusability, simplicity, and portability. Finally, an availability-aware CloudSim extension, ACE, is proposed. ACE extends CloudSim simulator with failure injection, computational paths, repair, failover, load balancing, and other availability-based modules

    Analysis of scaling policies for NFV providing 5G/6G reliability levels with fallible servers

    Get PDF
    The softwarization of mobile networks enables an efficient use of resources, by dynamically scaling and re-assigning them following variations in demand. Given that the activation of additional servers is not immediate, scaling up resources should anticipate traffic demands to prevent service disruption. At the same time, the activation of more servers than strictly necessary results in a waste of resources, and thus should be avoided. Given the stringent reliability requirements of 5G applications (up to 6 nines) and the fallible nature of servers, finding the right trade-off between efficiency and service disruption is particularly critical. In this paper, we analyze a generic auto-scaling mechanism for communication services, used to de(activate) servers in a cluster, based on occupation thresholds. We model the impact of the activation delay and the finite lifetime of the servers on performance, in terms of power consumption and failure probability. Based on this model, we derive an algorithm to optimally configure the thresholds. Simulation results confirm the accuracy of the model both under synthetic and realistic traffic patterns as well as the effectiveness of the configuration algorithm. We also provide some insights on the best strategy to support an energy-efficient highly-reliable service: deploying a few powerful and reliable machines versus deploying many machines, but less powerful and reliable.The work of Jorge Ortin was funded in part by the Spanish Ministry of Science under Grant RTI2018-099063-B-I00, in part by the Gobierno de Aragon through Research Group under Grant T31_20R, in part by the European Social Fund (ESF), and in part by Centro Universitario de la Defensa under Grant CUD-2021_11. The work of Pablo Serrano was partly funded by the European Commission (EC) through the H2020 project Hexa-X (Grant Agreement no. 101015956), and in part by Spanish State Research Agency (TRUE5G project, PID2019-108713RB-C52PID2019-108713RB-C52/AEI/ 10.13039/501100011033). The work of Jaime Garcia-Reinoso was partially supported by the EC in the framework of H2020-EU.2.1.1. 5G EVE project (Grant agreement no. 815074). The work of Albert Banchs was partially supported by the EC in the framework of H2020-EU.2.1.1. 5G-TOURS project (Grant agreement no. 856950) also partially supported by the Spanish State Research Agency (TRUE5G project, PID2019-108713RB-C52PID2019- 108713RB-C52/AEI/10.13039/501100011033)

    High availability using virtualization

    Get PDF
    High availability has always been one of the main problems for a data center. Till now high availability was achieved by host per host redundancy, a highly expensive method in terms of hardware and human costs. A new approach to the problem can be offered by virtualization. Using virtualization, it is possible to achieve a redundancy system for all the services running on a data center. This new approach to high availability allows to share the running virtual machines over the servers up and running, by exploiting the features of the virtualization layer: start, stop and move virtual machines between physical hosts. The system (3RC) is based on a finite state machine with hysteresis, providing the possibility to restart each virtual machine over any physical host, or reinstall it from scratch. A complete infrastructure has been developed to install operating system and middleware in a few minutes. To virtualize the main servers of a data center, a new procedure has been developed to migrate physical to virtual hosts. The whole Grid data center SNS-PISA is running at the moment in virtual environment under the high availability system. As extension of the 3RC architecture, several storage solutions have been tested to store and centralize all the virtual disks, from NAS to SAN, to grant data safety and access from everywhere. Exploiting virtualization and ability to automatically reinstall a host, we provide a sort of host on-demand, where the action on a virtual machine is performed only when a disaster occurs.Comment: PhD Thesis in Information Technology Engineering: Electronics, Computer Science, Telecommunications, pp. 94, University of Pisa [Italy

    Efficiency and Reliability Analysis of AC and 380V DC Data Centers

    Get PDF
    The rapid growth of the Internet has resulted in colossal increase in the number of data centers. A data center consume a tremendous amount of electricity resulting in high operation cost. Even a slight improvement in the power distribution system of a data center could save millions of dollars in electricity bills. Benchmarks for both AC and 380V DC data centers are developed and efficiency analyses thereof have been performed for an entire year. The efficiency of the power distribution system can be increased if number of power conversion stages can be reduced and more efficient converters are used. Use of wide band gap (WBG) converters will further improve the overall system efficiency because of its high efficiency. The results shows that 380V DC data centers are more efficient than AC data centers with and without PV integration. Using 380V DC distribution system not only improve the efficiency of the system, but it saves millions of dollars by decreasing system downtime. Maintaining high availability at all times is very critical to data centers. The distribution system with higher number of series components is more likely to fail, resulting in increased downtime. This study aims at comparing reliabilities of AC against 380V DC architecture. Reliability assessment was done for both AC and DC systems complying with Tier IV standard. The analysis was done for different level of redundancy (eg. N, N+1, N+2) in the UPS system for both AC and DC systems. Monte Carlo simulation method was used to perform the reliability calculations. The simulation results showed that the 380V DC distribution system has higher level of reliability than AC distribution system in data centers but only up to certain level of redundancy in the UPS system. The reliability level of AC system will approach to that of a DC system when a very high level of redundancy in the UPS system is considered, but this will increase the overall cost of a data center
    • …
    corecore