47 research outputs found

    Towards Continually Learning Application Performance Models

    Full text link
    Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are collected over time. However, owing to the complexity and heterogeneity of production HPC systems, they are susceptible to hardware degradation, replacement, and/or software patches, which can lead to drift in the data distribution that can adversely affect the performance models. To this end, we develop continually learning performance models that account for the distribution drift, alleviate catastrophic forgetting, and improve generalizability. Our best model was able to retain accuracy, regardless of having to learn the new distribution of data inflicted by system changes, while demonstrating a 2x improvement in the prediction accuracy of the whole data sequence in comparison to the naive approach.Comment: Presented at Workshop on Machine Learning for Systems at 36th Conference on Neural Information Processing Systems (NeurIPS 2022

    Verifying File System Properties with Type Inference

    No full text
    The storage stack is not trustworthy due to errors that arise from a variety of sources: unreliable hardware, malicious errors and file system bugs. Today, software errors play a dominant role due to their inherent complexity. In the first part of our project, we look towards verifying a specific file system property: on-disk pointer manipulation. We utilize CQUAL, a framework for adding type qualifiers with type inference support, and apply our analysis to the Linux ext2 file system. We find that adding qualifiers serves the valuable purpose of ensuring that on-disk pointers are accessed and manipulated correctly by the file system. Thus, we believe that the qualifiers we introduce would decrease the probability of bugs being introduced by file system programmers. We also describe our experience in using CQUAL and discuss its limitations. Based on our experience with CQUAL, we come up with a second analysis, a buffer management verifier, that fits better with the power of CQUAL by being simpler, yet more widely applicable to different file systems than the first analysis

    Impact of Limpware on HDFS: A Probabilistic Estimation

    No full text
    With the advent of cloud computing, thousands of machines are connected and managed collectively. This era is confronted with a new challenge: performance variability, primarily caused by large-scale management issues such as hardware failures, software bugs, and configuration mistakes. In our previous work [2] we highlighted one overlooked cause: limping hardware – hardware whose performance degrades significantly compared to its specification. We showed that limping hardware can cause many limping scenarios in current scale-out systems. In this report, we quantify how often these scenarios happen in the Hadoop Distributed File System.
    corecore