368 research outputs found

    Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

    Get PDF
    High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the first supercomputer capable executing more than an exaflop, 10^18 floating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extreme-scale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains effective on current systems, increasing the scale of today\u27s systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniques include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and software-based memory fault correction. In this thesis, I examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, I evaluate the potential impact of rollback avoidance on these systems. I then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, I examine the feasibility of using this technique to protect against memory faults in kernel memory

    A Concept of Operations for an Integrated Vehicle Health Assurance System

    Get PDF
    This document describes a Concept of Operations (ConOps) for an Integrated Vehicle Health Assurance System (IVHAS). This ConOps is associated with the Maintain Vehicle Safety (MVS) between Major Inspections Technical Challenge in the Vehicle Systems Safety Technologies (VSST) Project within NASA s Aviation Safety Program. In particular, this document seeks to describe an integrated system concept for vehicle health assurance that integrates ground-based inspection and repair information with in-flight measurement data for airframe, propulsion, and avionics subsystems. The MVS Technical Challenge intends to maintain vehicle safety between major inspections by developing and demonstrating new integrated health management and failure prevention technologies to assure the integrity of vehicle systems between major inspection intervals and maintain vehicle state awareness during flight. The approach provided by this ConOps is intended to help optimize technology selection and development, as well as allow the initial integration and demonstration of these subsystem technologies over the 5 year span of the VSST program, and serve as a guideline for developing IVHAS technologies under the Aviation Safety Program within the next 5 to 15 years. A long-term vision of IVHAS is provided to describe a basic roadmap for more intelligent and autonomous vehicle systems

    The Fifth Workshop on HPC Best Practices: File Systems and Archives

    Full text link
    The workshop on High Performance Computing (HPC) Best Practices on File Systems and Archives was the fifth in a series sponsored jointly by the Department Of Energy (DOE) Office of Science and DOE National Nuclear Security Administration. The workshop gathered technical and management experts for operations of HPC file systems and archives from around the world. Attendees identified and discussed best practices in use at their facilities, and documented findings for the DOE and HPC community in this report

    End-to-End Trust Fulfillment of Big Data Workflow Provisioning over Competing Clouds

    Get PDF
    Cloud Computing has emerged as a promising and powerful paradigm for delivering data- intensive, high performance computation, applications and services over the Internet. Cloud Computing has enabled the implementation and success of Big Data, a relatively recent phenomenon consisting of the generation and analysis of abundant data from various sources. Accordingly, to satisfy the growing demands of Big Data storage, processing, and analytics, a large market has emerged for Cloud Service Providers, offering a myriad of resources, platforms, and infrastructures. The proliferation of these services often makes it difficult for consumers to select the most suitable and trustworthy provider to fulfill the requirements of building complex workflows and applications in a relatively short time. In this thesis, we first propose a quality specification model to support dual pre- and post-cloud workflow provisioning, consisting of service provider selection and workflow quality enforcement and adaptation. This model captures key properties of the quality of work at different stages of the Big Data value chain, enabling standardized quality specification, monitoring, and adaptation. Subsequently, we propose a two-dimensional trust-enabled framework to facilitate end-to-end Quality of Service (QoS) enforcement that: 1) automates cloud service provider selection for Big Data workflow processing, and 2) maintains the required QoS levels of Big Data workflows during runtime through dynamic orchestration using multi-model architecture-driven workflow monitoring, prediction, and adaptation. The trust-based automatic service provider selection scheme we propose in this thesis is comprehensive and adaptive, as it relies on a dynamic trust model to evaluate the QoS of a cloud provider prior to taking any selection decisions. It is a multi-dimensional trust model for Big Data workflows over competing clouds that assesses the trustworthiness of cloud providers based on three trust levels: (1) presence of the most up-to-date cloud resource verified capabilities, (2) reputational evidence measured by neighboring users and (3) a recorded personal history of experiences with the cloud provider. The trust-based workflow orchestration scheme we propose aims to avoid performance degradation or cloud service interruption. Our workflow orchestration approach is not only based on automatic adaptation and reconfiguration supported by monitoring, but also on predicting cloud resource shortages, thus preventing performance degradation. We formalize the cloud resource orchestration process using a state machine that efficiently captures different dynamic properties of the cloud execution environment. In addition, we use a model checker to validate our monitoring model in terms of reachability, liveness, and safety properties. We evaluate both our automated service provider selection scheme and cloud workflow orchestration, monitoring and adaptation schemes on a workflow-enabled Big Data application. A set of scenarios were carefully chosen to evaluate the performance of the service provider selection, workflow monitoring and the adaptation schemes we have implemented. The results demonstrate that our service selection outperforms other selection strategies and ensures trustworthy service provider selection. The results of evaluating automated workflow orchestration further show that our model is self-adapting, self-configuring, reacts efficiently to changes and adapts accordingly while enforcing QoS of workflows

    Learning workload behaviour models from monitored time-series for resource estimation towards data center optimization

    Get PDF
    In recent years there has been an extraordinary growth of the demand of Cloud Computing resources executed in Data Centers. Modern Data Centers are complex systems that need management. As distributed computing systems grow, and workloads benefit from such computing environments, the management of such systems increases in complexity. The complexity of resource usage and power consumption on cloud-based applications makes the understanding of application behavior through expert examination difficult. The difficulty increases when applications are seen as "black boxes", where only external monitoring can be retrieved. Furthermore, given the different amount of scenarios and applications, automation is required. To deal with such complexity, Machine Learning methods become crucial to facilitate tasks that can be automatically learned from data. Firstly, this thesis proposes an unsupervised learning technique to learn high level representations from workload traces. Such technique provides a fast methodology to characterize workloads as sequences of abstract phases. The learned phase representation is validated on a variety of datasets and used in an auto-scaling task where we show that it can be applied in a production environment, achieving better performance than other state-of-the-art techniques. Secondly, this thesis proposes a neural architecture, based on Sequence-to-Sequence models, that provides the expected resource usage of applications sharing hardware resources. The proposed technique provides resource managers the ability to predict resource usage over time as well as the completion time of the running applications. The technique provides lower error predicting usage when compared with other popular Machine Learning methods. Thirdly, this thesis proposes a technique for auto-tuning Big Data workloads from the available tunable parameters. The proposed technique gathers information from the logs of an application generating a feature descriptor that captures relevant information from the application to be tuned. Using this information we demonstrate that performance models can generalize up to a 34% better when compared with other state-of-the-art solutions. Moreover, the search time to find a suitable solution can be drastically reduced, with up to a 12x speedup and almost equal quality results as modern solutions. These results prove that modern learning algorithms, with the right feature information, provide powerful techniques to manage resource allocation for applications running in cloud environments. This thesis demonstrates that learning algorithms allow relevant optimizations in Data Center environments, where applications are externally monitored and careful resource management is paramount to efficiently use computing resources. We propose to demonstrate this thesis in three areas that orbit around resource management in server environmentsEls Centres de Dades (Data Centers) moderns són sistemes complexos que necessiten ser gestionats. A mesura que creixen els sistemes de computació distribuïda i les aplicacions es beneficien d’aquestes infraestructures, també n’augmenta la seva complexitat. La complexitat que implica gestionar recursos de còmput i d’energia en sistemes de computació al núvol fa difícil entendre el comportament de les aplicacions que s'executen de manera manual. Aquesta dificultat s’incrementa quan les aplicacions s'observen com a "caixes negres", on només es poden monitoritzar algunes mètriques de les caixes de manera externa. A més, degut a la gran varietat d’escenaris i aplicacions, és necessari automatitzar la gestió d'aquests recursos. Per afrontar-ne el repte, l'aprenentatge automàtic juga un paper cabdal que facilita aquestes tasques, que poden ser apreses automàticament en base a dades prèvies del sistema que es monitoritza. Aquesta tesi demostra que els algorismes d'aprenentatge poden aportar optimitzacions molt rellevants en la gestió de Centres de Dades, on les aplicacions són monitoritzades externament i la gestió dels recursos és de vital importància per a fer un ús eficient de la capacitat de còmput d'aquests sistemes. En primer lloc, aquesta tesi proposa emprar aprenentatge no supervisat per tal d’aprendre representacions d'alt nivell a partir de traces d'aplicacions. Aquesta tècnica ens proporciona una metodologia ràpida per a caracteritzar aplicacions vistes com a seqüències de fases abstractes. La representació apresa de fases és validada en diferents “datasets” i s'aplica a la gestió de tasques d'”auto-scaling”, on es conclou que pot ser aplicable en un medi de producció, aconseguint un millor rendiment que altres mètodes de vanguardia. En segon lloc, aquesta tesi proposa l'ús de xarxes neuronals, basades en arquitectures “Sequence-to-Sequence”, que proporcionen una estimació dels recursos usats per aplicacions que comparteixen recursos de hardware. La tècnica proposada facilita als gestors de recursos l’habilitat de predir l'ús de recursos a través del temps, així com també una estimació del temps de còmput de les aplicacions. Tanmateix, redueix l’error en l’estimació de recursos en comparació amb d’altres tècniques populars d'aprenentatge automàtic. Per acabar, aquesta tesi introdueix una tècnica per a fer “auto-tuning” dels “hyper-paràmetres” d'aplicacions de Big Data. Consisteix així en obtenir informació dels “logs” de les aplicacions, generant un vector de característiques que captura informació rellevant de les aplicacions que s'han de “tunejar”. Emprant doncs aquesta informació es valida que els ”Regresors” entrenats en la predicció del rendiment de les aplicacions són capaços de generalitzar fins a un 34% millor que d’altres “Regresors” de vanguàrdia. A més, el temps de cerca per a trobar una bona solució es pot reduir dràsticament, aconseguint un increment de millora de fins a 12 vegades més dels resultats de qualitat en contraposició a alternatives modernes. Aquests resultats posen de manifest que els algorismes moderns d'aprenentatge automàtic esdevenen tècniques molt potents per tal de gestionar l'assignació de recursos en aplicacions que s'executen al núvol.Arquitectura de computador

    Management And Security Of Multi-Cloud Applications

    Get PDF
    Single cloud management platform technology has reached maturity and is quite successful in information technology applications. Enterprises and application service providers are increasingly adopting a multi-cloud strategy to reduce the risk of cloud service provider lock-in and cloud blackouts and, at the same time, get the benefits like competitive pricing, the flexibility of resource provisioning and better points of presence. Another class of applications that are getting cloud service providers increasingly interested in is the carriers\u27 virtualized network services. However, virtualized carrier services require high levels of availability and performance and impose stringent requirements on cloud services. They necessitate the use of multi-cloud management and innovative techniques for placement and performance management. We consider two classes of distributed applications – the virtual network services and the next generation of healthcare – that would benefit immensely from deployment over multiple clouds. This thesis deals with the design and development of new processes and algorithms to enable these classes of applications. We have evolved a method for optimization of multi-cloud platforms that will pave the way for obtaining optimized placement for both classes of services. The approach that we have followed for placement itself is predictive cost optimized latency controlled virtual resource placement for both types of applications. To improve the availability of virtual network services, we have made innovative use of the machine and deep learning for developing a framework for fault detection and localization. Finally, to secure patient data flowing through the wide expanse of sensors, cloud hierarchy, virtualized network, and visualization domain, we have evolved hierarchical autoencoder models for data in motion between the IoT domain and the multi-cloud domain and within the multi-cloud hierarchy

    Market driven elastic secure infrastructure

    Full text link
    In today’s Data Centers, a combination of factors leads to the static allocation of physical servers and switches into dedicated clusters such that it is difficult to add or remove hardware from these clusters for short periods of time. This silofication of the hardware leads to inefficient use of clusters. This dissertation proposes a novel architecture for improving the efficiency of clusters by enabling them to add or remove bare-metal servers for short periods of time. We demonstrate by implementing a working prototype of the architecture that such silos can be broken and it is possible to share servers between clusters that are managed by different tools, have different security requirements, and are operated by tenants of the Data Center, which may not trust each other. Physical servers and switches in a Data Center are grouped for a combination of reasons. They are used for different purposes (staging, production, research, etc); host applications required for servicing specific workloads (HPC, Cloud, Big Data, etc); and/or configured to meet stringent security and compliance requirements. Additionally, different provisioning systems and tools such as Openstack-Ironic, MaaS, Foreman, etc that are used to manage these clusters take control of the servers making it difficult to add or remove the hardware from their control. Moreover, these clusters are typically stood up with sufficient capacity to meet anticipated peak workload. This leads to inefficient usage of the clusters. They are under-utilized during off-peak hours and in the cases where the demand exceeds capacity the clusters suffer from degraded quality of service (QoS) or may violate service level objectives (SLOs). Although today’s clouds offer huge benefits in terms of on-demand elasticity, economies of scale, and a pay-as-you-go model yet many organizations are reluctant to move their workloads to the cloud. Organizations that (i) needs total control of their hardware (ii) has custom deployment practices (iii) needs to match stringent security and compliance requirements or (iv) do not want to pay high costs incurred from running workloads in the cloud prefers to own its hardware and host it in a data center. This includes a large section of the economy including financial companies, medical institutions, and government agencies that continue to host their own clusters outside of the public cloud. Considering that all the clusters may not undergo peak demand at the same time provides an opportunity to improve the efficiency of clusters by sharing resources between them. The dissertation describes the design and implementation of the Market Driven Elastic Secure Infrastructure (MESI) as an alternative to the public cloud and as an architecture for the lowest layer of the public cloud to improve its efficiency. It allows mutually non-trusting physically deployed services to share the physical servers of a data center efficiently. The approach proposed here is to build a system composed of a set of services each fulfilling a specific functionality. A tenant of the MESI has to trust only a minimal functionality of the tenant that offers the hardware resources. The rest of the services can be deployed by each tenant themselves MESI is based on the idea of enabling tenants to share hardware they own with tenants they may not trust and between clusters with different security requirements. The architecture provides control and freedom of choice to the tenants whether they wish to deploy and manage these services themselves or use them from a trusted third party. MESI services fit into three layers that build on each other to provide: 1) Elastic Infrastructure, 2) Elastic Secure Infrastructure, and 3) Market-driven Elastic Secure Infrastructure. 1) Hardware Isolation Layer (HIL) – the bottommost layer of MESI is designed for moving nodes between multiple tools and schedulers used for managing the clusters. It defines HIL to control the layer 2 switches and bare-metal servers such that tenants can elastically adjust the size of the clusters in response to the changing demand of the workload. It enables the movement of nodes between clusters with minimal to no modifications required to the tools and workflow used for managing these clusters. (2) Elastic Secure Infrastructure (ESI) builds on HIL to enable sharing of servers between clusters with different security requirements and mutually non-trusting tenants of the Data Center. ESI enables the borrowing tenant to minimize its trust in the node provider and take control of trade-offs between cost, performance, and security. This enables sharing of nodes between tenants that are not only part of the same organization by can be organization tenants in a co-located Data Center. (3) The Bare-metal Marketplace is an incentive-based system that uses economic principles of the marketplace to encourage the tenants to share their servers with others not just when they do not need them but also when others need them more. It provides tenants the ability to define their own cluster objectives and sharing constraints and the freedom to decide the number of nodes they wish to share with others. MESI is evaluated using prototype implementations at each layer of the architecture. (i) The HIL prototype implemented with only 3000 Lines of Code (LOC) is able to support many provisioning tools and schedulers with little to no modification; adds no overhead to the performance of the clusters and is in active production use at MOC managing over 150 servers and 11 switches. (ii) The ESI prototype builds on the HIL prototype and adds to it an attestation service, a provisioning service, and a deterministically built open-source firmware. Results demonstrate that it is possible to build a cluster that is secure, elastic, and fairly quick to set up. The tenant requires only minimum trust in the provider for the availability of the node. (iii) The MESI prototype demonstrates the feasibility of having a one-of-kind multi-provider marketplace for trading bare-metal servers where providers also use the nodes. The evaluation of the MESI prototype shows that all the clusters benefit from participating in the marketplace. It uses agents to trade bare-metal servers in a marketplace to meet the requirements of their clusters. Results show that compared to operating as silos individual clusters see a 50% improvement in the total work done; up to 75% improvement (reduction) in waiting for queues and up to 60% improvement in the aggregate utilization of the test bed. This dissertation makes the following contributions: (i) It defines the architecture of MESI allows mutually non-trusting tenants of the data center to share resources between clusters with different security requirements. (ii) Demonstrates that it is possible to design a service that breaks the silos of static allocation of clusters yet has a small Trusted Computing Base (TCB) and no overhead to the performance of the clusters. (iii) Provides a unique architecture that puts the tenant in control of its own security and minimizes the trust needed in the provider for sharing nodes. (iv) A working prototype of a multi-provider marketplace for bare-metal servers which is a first proof-of-concept that demonstrates that it is possible to trade real bare-metal nodes at practical time scales such that moving nodes between clusters is sufficiently fast to be able to get some useful work done. (v) Finally results show that it is possible to encourage even mutually non-trusting tenants to share their nodes with each other without any central authority making allocation decisions. Many smart, dedicated engineers and researchers have contributed to this work over the years. I have jointly led the efforts to design the HIL and the ESI layer; led the design and implementation of the bare-metal marketplace and the overall MESI architecture
    corecore