22 research outputs found

    What broke where for distributed and parallel applications — a whodunit story

    Get PDF
    Detection, diagnosis and mitigation of performance problems in today\u27s large-scale distributed and parallel systems is a difficult task. These large distributed and parallel systems are composed of various complex software and hardware components. When the system experiences some performance or correctness problem, developers struggle to understand the root cause of the problem and fix in a timely manner. In my thesis, I address these three components of the performance problems in computer systems. First, we focus on diagnosing performance problems in large-scale parallel applications running on supercomputers. We developed techniques to localize the performance problem for root-cause analysis. Parallel applications, most of which are complex scientific simulations running in supercomputers, can create up to millions of parallel tasks that run on different machines and communicate using the message passing paradigm. We developed a highly scalable and accurate automated debugging tool called PRODOMETER, which uses sophisticated algorithms to first, create a logical progress dependency graph of the tasks to highlight how the problem spread through the system manifesting as a system-wide performance issue. Second, uses this logical progress dependence graph to identify the task where the problem originated. Finally, PRODOMETER pinpoints the code region corresponding to the origin of the bug. Second, we developed a tool-chain that can detect performance anomaly using machine-learning techniques and can achieve very low false positive rate. Our input-aware performance anomaly detection system consists of a scalable data collection framework to collect performance related metrics from different granularity of code regions, an offline model creation and prediction-error characterization technique, and a threshold based anomaly-detection-engine for production runs. Our system requires few training runs and can handle unknown inputs and parameter combinations by dynamically calibrating the anomaly detection threshold according to the characteristics of the input data and the characteristics of the prediction-error of the models. Third, we developed performance problem mitigation scheme for erasure-coded distributed storage systems. Repair operations of the failed blocks in erasure-coded distributed storage system take really long time in networked constrained data-centers. The reason being, during the repair operation for erasure-coded distributed storage, a lot of data from multiple nodes are gathered into a single node and then a mathematical operation is performed to reconstruct the missing part. This process severely congests the links toward the destination where newly recreated data is to be hosted. We proposed a novel distributed repair technique, called Partial-Parallel-Repair (PPR) that performs this reconstruction in parallel on multiple nodes and eliminates network bottlenecks, and as a result, greatly speeds up the repair process. Fourth, we study how for a class of applications, performance can be improved (or performance problems can be mitigated) by selectively approximating some of the computations. For many applications, the main computation happens inside a loop that can be logically divided into a few temporal segments, we call phases. We found that while approximating the initial phases might severely degrade the quality of the results, approximating the computation for the later phases have very small impact on the final quality of the result. Based on this observation, we developed an optimization framework that for a given budget of quality-loss, would find the best approximation settings for each phase in the execution

    OPTIMAL REQUIREMENT DETERMINATION FOR PRICING AVAILABILITY-BASED SUSTAINMENT CONTRACTS

    Get PDF
    Sustainment constitutes 70% or more of the total life-cycle cost of many safety-, mission- and infrastructure-critical systems. Prediction and control of the life-cycle cost is an essential part of all sustainment contracts. For many types of systems, availability is the most critical factor in determining the total life-cycle cost of the system. To address this, availability-based contracts have been introduced into the governmental and non-governmental acquisitions space (e.g., energy, defense, transportation, and healthcare).However, the development, implementation, and impact of availability requirements within contracts is not well understood. This dissertation develops a decision support model based on contract theory, formal modeling and stochastic optimization for availability-based contract design. By adoption and extension of the “availability payment” concept introduced for civil infrastructure Public-Private Partnerships (PPPs) and pricing for Performance-Based Logistics (PBL) contracts, this dissertation develops requirements that maximize the outcome of contracts for both parties. Under the civil infrastructure “availability payment” PPP, once the asset is available for use, the private sector begins receiving a periodical payment for the contracted number of years based on meeting performance requirements. This approach has been applied to highways, bridges, etc. The challenge is to determine the most effective requirements, metrics and payment model that protects the public interest, (i.e., does not overpay the private sector) but also minimizes that risk that the asset will become unsupported. This dissertation focuses on availability as the key required outcome for mission-critical systems and provides a methodology for finding the optimum requirements and optimum payment parameters, and introduces new metrics into availability-based contract structures. In a product-service oriented environment, formal modeling of contracts (for both the customer and the contractor) will be necessary for pricing, negotiations, and transparency. Conventional methods for simulating a system through its life cycle do not include the effect of the relationship between the contractor and customer. This dissertation integrates engineering models with the incentive structure using a game theoretic simulation, affine controller design and stochastic optimization. The model has been used to explore the optimum availability assessment window (i.e., the length of time over which availability must be assessed) for an availability-based contract

    Edge Computing for Internet of Things

    Get PDF
    The Internet-of-Things is becoming an established technology, with devices being deployed in homes, workplaces, and public areas at an increasingly rapid rate. IoT devices are the core technology of smart-homes, smart-cities, intelligent transport systems, and promise to optimise travel, reduce energy usage and improve quality of life. With the IoT prevalence, the problem of how to manage the vast volumes of data, wide variety and type of data generated, and erratic generation patterns is becoming increasingly clear and challenging. This Special Issue focuses on solving this problem through the use of edge computing. Edge computing offers a solution to managing IoT data through the processing of IoT data close to the location where the data is being generated. Edge computing allows computation to be performed locally, thus reducing the volume of data that needs to be transmitted to remote data centres and Cloud storage. It also allows decisions to be made locally without having to wait for Cloud servers to respond

    Volume II: Mining Innovation

    Get PDF
    Contemporary exploitation of natural raw materials by borehole, opencast, underground, seabed, and anthropogenic deposits is closely related to, among others, geomechanics, automation, computer science, and numerical methods. More and more often, individual fields of science coexist and complement each other, contributing to lowering exploitation costs, increasing production, and reduction of the time needed to prepare and exploit the deposit. The continuous development of national economies is related to the increasing demand for energy, metal, rock, and chemical resources. Very often, exploitation is carried out in complex geological and mining conditions, which are accompanied by natural hazards such as rock bursts, methane, coal dust explosion, spontaneous combustion, water, gas, and temperature. In order to conduct a safe and economically justified operation, modern construction materials are being used more and more often in mining to support excavations, both under static and dynamic loads. The individual production stages are supported by specialized computer programs for cutting the deposit as well as for modeling the behavior of the rock mass after excavation in it. Currently, the automation and monitoring of the mining works play a very important role, which will significantly contribute to the improvement of safety conditions. In this Special Issue of Energies, we focus on innovative laboratory, numerical, and industrial research that has a positive impact on the development of safety and exploitation in mining

    MEMS Accelerometers

    Get PDF
    Micro-electro-mechanical system (MEMS) devices are widely used for inertia, pressure, and ultrasound sensing applications. Research on integrated MEMS technology has undergone extensive development driven by the requirements of a compact footprint, low cost, and increased functionality. Accelerometers are among the most widely used sensors implemented in MEMS technology. MEMS accelerometers are showing a growing presence in almost all industries ranging from automotive to medical. A traditional MEMS accelerometer employs a proof mass suspended to springs, which displaces in response to an external acceleration. A single proof mass can be used for one- or multi-axis sensing. A variety of transduction mechanisms have been used to detect the displacement. They include capacitive, piezoelectric, thermal, tunneling, and optical mechanisms. Capacitive accelerometers are widely used due to their DC measurement interface, thermal stability, reliability, and low cost. However, they are sensitive to electromagnetic field interferences and have poor performance for high-end applications (e.g., precise attitude control for the satellite). Over the past three decades, steady progress has been made in the area of optical accelerometers for high-performance and high-sensitivity applications but several challenges are still to be tackled by researchers and engineers to fully realize opto-mechanical accelerometers, such as chip-scale integration, scaling, low bandwidth, etc

    An Embryonics Inspired Architecture for Resilient Decentralised Cloud Service Delivery

    Get PDF
    Data-driven artificial intelligence applications arising from Internet of Things technologies can have profound wide-reaching societal benefits at the cross-section of the cyber and physical domains. Usecases are expanding rapidly. For example, smart-homes and smart-buildings provide intelligent monitoring, resource optimisation, safety, and security for their inhabitants. Smart cities can manage transport, waste, energy, and crime on large scales. Whilst smart-manufacturing can autonomously produce goods through the self-management of factories and logistics. As these use-cases expand further, the requirement to ensure data is processed accurately and timely is ever crucial, as many of these applications are safety critical. Where loss off life and economic damage is a likely possibility in the event of system failure. While the typical service delivery paradigm, cloud computing, is strong due to operating upon economies of scale, their physical proximity to these applications creates network latency which is incompatible with these safety critical applications. To complicate matters further, the environments they operate in are becoming increasingly hostile. With resource-constrained and mobile wireless networking, commonplace. These issues drive the need for new service delivery architectures which operate closer to, or even upon, the network devices, sensors and actuators which compose these IoT applications at the network edge. These hostile and resource constrained environments require adaptation of traditional cloud service delivery models to these decentralised mobile and wireless environments. Such architectures need to provide persistent service delivery within the face of a variety of internal and external changes or: resilient decentralised cloud service delivery. While the current state of the art proposes numerous techniques to enhance the resilience of services in this manner, none provide an architecture which is capable of providing data processing services in a cloud manner which is inherently resilient. Adopting techniques from autonomic computing, whose characteristics are resilient by nature, this thesis presents a biologically-inspired platform modelled on embryonics. Embryonic systems have an ability to self-heal and self-organise whilst showing capacity to support decentralised data processing. An initial model for embryonics-inspired resilient decentralised cloud service delivery is derived according to both the decentralised cloud, and resilience requirements given for this work. Next, this model is simulated using cellular automata, which illustrate the embryonic concept’s ability to provide self-healing service delivery under varying system component loss. This highlights optimisation techniques, including: application complexity bounds, differentiation optimisation, self-healing aggression, and varying system starting conditions. All attributes of which can be adjusted to vary the resilience performance of the system depending upon different resource capabilities and environmental hostilities. Next, a proof-of-concept implementation is developed and validated which illustrates the efficacy of the solution. This proof-of-concept is evaluated on a larger scale where batches of tests highlighted the different performance criteria and constraints of the system. One key finding was the considerable quantity of redundant messages produced under successful scenarios which were helpful in terms of enabling resilience yet could increase network contention. Therefore balancing these attributes are important according to use-case. Finally, graph-based resilience algorithms were executed across all tests to understand the structural resilience of the system and whether this enabled suitable measurements or prediction of the application’s resilience. Interestingly this study highlighted that although the system was not considered to be structurally resilient, the applications were still being executed in the face of many continued component failures. This highlighted that the autonomic embryonic functionality developed was succeeding in executing applications resiliently. Illustrating that structural and application resilience do not necessarily coincide. Additionally, one graph metric, assortativity, was highlighted as being predictive of application resilience, although not structural resilience

    Self-repair during continuous motion with modular robots

    Get PDF
    Through the use of multiple modules with the ability to reconfigure to form different morphologies, modular robots provide a potential method to develop more adaptable and resilient robots. Robots operating in challenging and hard-to-reach environments such as infrastructure inspection, post-disaster search-and-rescue under rubble and planetary surface exploration, could benefit from the capabilities modularity offers, especially the inherent fault tolerance which reconfigurability can provide. With self-reconfigurable modular robots self-repair, removing failed modules from a larger structure to replace them with operating modules, allows the functionality of the multi-robot organism as a whole to be recovered when modules are damaged. Previous self-repair work has, for the duration of self-repair procedures, paused group tasks in which the multi-robot organism was engaged, this thesis investigates Self-repair during continuous motion, ``Dynamic Self-repair", as a way to allow repair and group tasks to proceed concurrently. In this thesis a new modular robotic platform, Omni-Pi-tent, with capabilities for Dynamic Self-repair is developed. This platform provides a unique combination of genderless docking, omnidirectional locomotion, 3D reconfiguration possibilities and onboard sensing and autonomy. The platform is used in a series of simulated experiments to compare the performance of newly developed dynamic strategies for self-repair and self-assembly to adaptations of previous work, and in hardware demonstrations to explore their practical feasibility. Novel data structures for defining modular robotic structures, and the algorithms to process them for self-repair, are explained. It is concluded that self-repair during continuous motion can allow modular robots to complete tasks faster, and more effectively, than self-repair strategies which require collective tasks to be halted. The hardware and strategies developed in this thesis should provide valuable lessons for bringing modular robots closer to real-world applications

    Operational Research: Methods and Applications

    Get PDF
    Throughout its history, Operational Research has evolved to include a variety of methods, models and algorithms that have been applied to a diverse and wide range of contexts. This encyclopedic article consists of two main sections: methods and applications. The first aims to summarise the up-to-date knowledge and provide an overview of the state-of-the-art methods and key developments in the various subdomains of the field. The second offers a wide-ranging list of areas where Operational Research has been applied. The article is meant to be read in a nonlinear fashion. It should be used as a point of reference or first-port-of-call for a diverse pool of readers: academics, researchers, students, and practitioners. The entries within the methods and applications sections are presented in alphabetical order. The authors dedicate this paper to the 2023 Turkey/Syria earthquake victims. We sincerely hope that advances in OR will play a role towards minimising the pain and suffering caused by this and future catastrophes
    corecore