1,778 research outputs found

    Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

    Full text link
    Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead. In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of virtual cloud automation technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steady-state processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally

    Delay prediction system for large-scale railway networks based on big data analytics

    Get PDF
    State-of-the-art train delay prediction systems do not exploit historical train movements data collected by the railway information systems, but they rely on static rules built by expert of the railway infrastructure based on classical univariate statistic. The purpose of this paper is to build a data-driven train delay prediction system for largescale railway networks which exploits the most recent Big Data technologies and learning algorithms. In particular, we propose a fast learning algorithm for predicting train delays based on the Extreme Learning Machine that fully exploits the recent in-memory large-scale data processing technologies. Our system is able to rapidly extract nontrivial information from the large amount of data available in order to make accurate predictions about different future states of the railway network. Results on real world data coming from the Italian railway network show that our proposal is able to improve the current state-of-the-art train delay prediction systems

    TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

    Full text link
    Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by chatGPT, have achieved profound impact on various fields. However, training LLMs with super-large-scale parameters requires large high-performance GPU clusters and long training periods lasting for months. Due to the inevitable hardware and software failures in large-scale clusters, maintaining uninterrupted and long-duration training is extremely challenging. As a result, A substantial amount of training time is devoted to task checkpoint saving and loading, task rescheduling and restart, and task manual anomaly checks, which greatly harms the overall training efficiency. To address these issues, we propose TRANSOM, a novel fault-tolerant LLM training system. In this work, we design three key subsystems: the training pipeline automatic fault tolerance and recovery mechanism named Transom Operator and Launcher (TOL), the training task multi-dimensional metric automatic anomaly detection system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous access automatic fault tolerance and recovery technology named Transom Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks, while TEE is responsible for task monitoring and anomaly reporting. TEE detects training anomalies and reports them to TOL, who automatically enters the fault tolerance strategy to eliminate abnormal nodes and restart the training task. And the asynchronous checkpoint saving and loading functionality provided by TCE greatly shorten the fault tolerance overhead. The experimental results indicate that TRANSOM significantly enhances the efficiency of large-scale LLM training on clusters. Specifically, the pre-training time for GPT3-175B has been reduced by 28%, while checkpoint saving and loading performance have improved by a factor of 20.Comment: 14 pages, 9 figure

    Methods for event time series prediction and anomaly detection

    Get PDF
    Event time series are sequences of events occurring in continuous time. They arise in many real-world problems and may represent, for example, posts in social media, administrations of medications to patients, or adverse events, such as episodes of atrial fibrillation or earthquakes. In this work, we study and develop methods for prediction and anomaly detection on event time series. We study two general approaches. The first approach converts event time series to regular time series of counts via time discretization. We develop methods relying on (a) nonparametric time series decomposition and (b) dynamic linear models for regular time series. The second approach models the events in continuous time directly. We develop methods relying on point processes. For prediction, we develop a new model based on point processes to combine the advantages of existing models. It is flexible enough to capture complex dependency structures between events, while not sacrificing applicability in common scenarios. For anomaly detection, we develop methods that can detect new types of anomalies in continuous time and that show advantages compared to time discretization

    Reliability models for HPC applications and a Cloud economic model

    Get PDF
    With the enormous number of computing resources in HPC and Cloud systems, failures become a major concern. Therefore, failure behaviors such as reliability, failure rate, and mean time to failure need to be understood to manage such a large system efficiently. This dissertation makes three major contributions in HPC and Cloud studies. First, a reliability model with correlated failures in a k-node system for HPC applications is studied. This model is extended to improve accuracy by accounting for failure correlation. Marshall-Olkin Multivariate Weibull distribution is improved by excess life, conditional Weibull, to better estimate system reliability. Also, the univariate method is proposed for estimating Marshall-Olkin Multivariate Weibull parameters of a system composed of a large number of nodes. Then, failure rate, and mean time to failure are derived. The model is validated by using log data from Blue Gene/L system at LLNL. Results show that when failures of nodes in the system have correlation, the system becomes less reliable. Secondly, a reliability model of Cloud computing is proposed. The reliability model and mean time to failure and failure rate are estimated based on a system of k nodes and s virtual machines under four scenarios: 1) Hardware components fail independently, and software components fail independently; 2) software components fail independently, and hardware components are correlated in failure; 3) correlated software failure and independent hardware failure; and 4) dependent software and hardware failure. Results show that if the failure of the nodes and/or software in the system possesses a degree of dependency, the system becomes less reliable. Also, an increase in the number of computing components decreases the reliability of the system. Finally, an economic model for a Cloud service provider is proposed. This economic model aims at maximizing profit based on the right pricing and rightsizing in the Cloud data center. Total cost is a key element in the model and it is analyzed by considering the Total Cost of Ownership (TCO) of the Cloud

    Hidden Markov Models

    Get PDF
    Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research

    Multi-step Ahead Inflow Forecasting for a Norwegian Hydro-Power Use-Case, Based on Spatial-Temporal Attention Mechanism

    Get PDF
    Hydrological forecasting has been an ongoing area of research due to its importance to improve decision making on water resource management, flood management, and climate change mitigation. With the increasing availability of hydrological data, Machine Learning (ML) techniques have started to play an important role, enabling us to better understand and predict complex hydrological events. However, some challenges remain. Hydrological processes have spatial and temporal dependencies that are not always easy to capture with traditional ML models, and a thorough understanding of these dependencies is essential when developing accurate predictive models. This thesis explores the use of ML techniques in hydrological forecasting and consists of an introduction, two papers, and an application developed alongside the case study. The motivation for this research is to enhance our understanding of the spatial and temporal dependencies in hydrological processes and to explore how ML techniques, particularly those incorporating attention mechanisms, can aid in hydrological forecasting. The first paper is a chronological literature review that explores the development of data-driven forecasting in hydrology, and highlighting the potential application of attention mechanisms in hydrological forecasting. These attention mechanisms have proven to be successful in various domains, allowing models to focus on the most relevant parts of the input for making predictions, which is particularly useful when dealing with spatial and temporal data. The second paper is a case study of a specific ML model incorporating these attention mechanisms. The focus is to illustrate the influence of spatial and temporal dependencies in a real-world hydrological forecasting scenario, thereby showcasing the practical application of these techniques. In parallel with the case study, an application has been developed, employing the principles and techniques discovered throughout the course of this research. The application aims to provide a practical demonstration of the concepts explored in the thesis, contributing to the field of hydrological forecasting by introducing a tool for hydropower suppliers.Masteroppgave i Programvareutvikling samarbeid med HVLPROG399MAMN-PRO

    Enhancing Grid Reliability With Phasor Measurement Units

    Get PDF
    Over the last decades, great efforts and investments have been made to increase the integration level of renewable energy resources in power grids. The New York State has set the goal to achieve 70% renewable generations by 2030, and realize carbon neutrality by 2040 eventually. However, the increased level of uncertainty brought about by renewables makes it more challenging to maintain stable and robust power grid operation. In addition to renewable energy resources, the ever-increasing number of electric vehicles and active loads have further increased the uncertainties in power systems. All these factors challenge the way the power grids are operated, and thus ask for new solutions to maintain stable and reliable grids. To meet the emerging requirements, advanced metering infrastructures are being integrated into power grids that transform traditional grids into \u27\u27 smart grids . One example is the widely deployed phasor measurement units (PMUs), which enable generating time-synchronized measurements with high sampling frequency, and pave a new path to realize real-time monitoring and control in power grids. However,the massive data generated by PMUs raises the questions of how to efficiently utilize the obtained measurements to understand and control the present system. Additionally, to meet the communication requirements between the advanced meters, the connectivity of the cyber layer has become more sophisticated, and thus is exposed to more cyber-attacks than before. Therefore, to enhance the grid reliability with PMUs, robust and efficient grid monitoring and control methods are required. This dissertation focuses on three important aspects of improving grid reliability with PMUs: (1) power system event detection; (2) impact assessment regarding both steady-state and transient stability; and (3) impact mitigation. In this dissertation, a comprehensive introduction of PMUs in the wide-area monitoring system, and comparisons with the existing supervisory control and data acquisition (SCADA) systems are presented first. Next, a data-driven event detection method is developed for efficient event detection with PMU measurements. A text mining approach is utilized to extract event oscillation patterns and determine event types. To ensure the integrity of the received data, the developed detection method is further designed to identify the fake events, and thus is robust against cyber-threat. Once a real event is detected, it is critical to promptly understand the consequences of the event in both steady and dynamic states. Sometimes, a single system event, e.g., a transmission line fault, may cause subsequent failures that lead to a cascading failure in the grid. In the worst case, these failures can result in large-scale blackouts. To assess the risk of an event in steady state, a probabilistic cascading failure model is developed. With the real-time phasor measurements, the failure probability of each system component at a specific operating condition can be predicted. In terms of the dynamic state, a failure of a system component may cause generators to lose synchronism, which will damage the power plant and lead to a blackout. To predict the transient stability after an event, a predictive online transient stability assessment (TSA) tool is developed in this dissertation. With only one sample of the PMU voltage measurements, the status of the transient stability can be predicted within cycles. In addition to the impact detection and assessment, it is also critical to identify proper mitigations to alleviate the failures. In this dissertation, a data-driven model predictive control strategy is developed. As a parameter-based system model is vulnerable to topology errors, a data-driven model is developed to mimic the grid behavior. Rather than utilizing the system parameters to construct the grid model, the data-driven model only leverages the received phasor measurements to determine proper corrective actions. Furthermore, to be robust against cyber-attacks, a check-point protocol, where past stored trustworthy data can be used to amend the attacked data, is utilized. The overall objective of this dissertation is to efficiently utilize advanced PMUs to detect, assess, and mitigate system failure, and help improve grid reliability

    Coping with recall and precision of soft error detectors

    Get PDF
    International audienceMany methods are available to detect silent errors in high-performance computing (HPC) applications. Each method comes with a cost, a recall (fraction of all errors that are actually detected, i.e., false negatives), and a precision (fraction of true errors amongst all detected errors, i.e., false positives). The main contribution of this paper is to characterize the optimal computing pattern for an application: which detector(s) to use, how many detectors of each type to use, together with the length of the work segment that precedes each of them. We first prove that detectors with imperfect precisions offer limited usefulness. Then we focus on detectors with perfect precision , and we conduct a comprehensive complexity analysis of this optimization problem, showing NP-completeness and designing an FPTAS (Fully Polynomial-Time Approximation Scheme). On the practical side, we provide a greedy algorithm, whose performance is shown to be close to the optimal for a realistic set of evaluation scenarios. Extensive simulations illustrate the usefulness of detectors with false negatives, which are available at a lower cost than the guaranteed detectors
    • …
    corecore