438 research outputs found

    Intelligent Resource Prediction for HPC and Scientific Workflows

    Get PDF
    Scientific workflows and high-performance computing (HPC) platforms are critically important to modern scientific research. In order to perform scientific experiments at scale, domain scientists must have knowledge and expertise in software and hardware systems that are highly complex and rapidly evolving. While computational expertise will be essential for domain scientists going forward, any tools or practices that reduce this burden for domain scientists will greatly increase the rate of scientific discoveries. One challenge that exists for domain scientists today is knowing the resource usage patterns of an application for the purpose of resource provisioning. A tool that accurately estimates these resource requirements would benefit HPC users in many ways, by reducing job failures and queue times on traditional HPC platforms and reducing costs on cloud computing platforms. To that end, we present Tesseract, a semi-automated tool that predicts resource usage for any application on any computing platform, from historical data, with minimal input from the user. We employ Tesseract to predict runtime, memory usage, and disk usage for a diverse set of scientific workflows, and in particular we show how these resource estimates can prevent under-provisioning. Finally, we leverage this core prediction capability to develop solutions for the related challenges of anomaly detection, cross-platform runtime prediction, and cost prediction

    Using Workload Prediction and Federation to Increase Cloud Utilization

    Get PDF
    The wide-spread adoption of cloud computing has changed how large-scale computing infrastructure is built and managed. Infrastructure-as-a-Service (IaaS) clouds consolidate different separate workloads onto a shared platform and provide a consistent quality of service by overprovisioning capacity. This additional capacity, however, remains idle for extended periods of time and represents a drag on system efficiency.The smaller scale of private IaaS clouds compared to public clouds exacerbates overprovisioning inefficiencies as opportunities for workload consolidation in private clouds are limited. Federation and cycle harvesting capabilities from computational grids help to improve efficiency, but to date have seen only limited adoption in the cloud due to a fundamental mismatch between the usage models of grids and clouds. Computational grids provide high throughput of queued batch jobs on a best-effort basis and enforce user priorities through dynamic job preemption, while IaaS clouds provide immediate feedback to user requests and make ahead-of-time guarantees about resource availability.We present a novel method to enable workload federation across IaaS clouds that overcomes this mismatch between grid and cloud usage models and improves system efficiency while also offering availability guarantees. We develop a new method for faster-than-realtime simulation of IaaS clouds to make predictions about system utilization and leverage this method to estimate the future availability of preemptible resources in the cloud. We then use these estimates to perform careful admission control and provide ahead-of-time bounds on the preemption probability of federated jobs executing on preemptible resources. Finally, we build an end-to-end prototype that addresses practical issues of workload federation and evaluate the prototype's efficacy using real-world traces from big data and compute-intensive production workloads

    Novel Representation Learning Technique using Graphs for Performance Analytics

    Full text link
    The performance analytics domain in High Performance Computing (HPC) uses tabular data to solve regression problems, such as predicting the execution time. Existing Machine Learning (ML) techniques leverage the correlations among features given tabular datasets, not leveraging the relationships between samples directly. Moreover, since high-quality embeddings from raw features improve the fidelity of the downstream predictive models, existing methods rely on extensive feature engineering and pre-processing steps, costing time and manual effort. To fill these two gaps, we propose a novel idea of transforming tabular performance data into graphs to leverage the advancement of Graph Neural Network-based (GNN) techniques in capturing complex relationships between features and samples. In contrast to other ML application domains, such as social networks, the graph is not given; instead, we need to build it. To address this gap, we propose graph-building methods where nodes represent samples, and the edges are automatically inferred iteratively based on the similarity between the features in the samples. We evaluate the effectiveness of the generated embeddings from GNNs based on how well they make even a simple feed-forward neural network perform for regression tasks compared to other state-of-the-art representation learning techniques. Our evaluation demonstrates that even with up to 25% random missing values for each dataset, our method outperforms commonly used graph and Deep Neural Network (DNN)-based approaches and achieves up to 61.67% & 78.56% improvement in MSE loss over the DNN baseline respectively for HPC dataset and Machine Learning Datasets.Comment: This paper has been accepted at 22nd International Conference on Machine Learning and Applications (ICMLA2023

    Load Balancing with Job-Size Testing: Performance Improvement or Degradation?

    Full text link
    In the context of decision making under explorable uncertainty, scheduling with testing is a powerful technique used in the management of computer systems to improve performance via better job-dispatching decisions. Upon job arrival, a scheduler may run some \emph{testing algorithm} against the job to extract some information about its structure, e.g., its size, and properly classify it. The acquisition of such knowledge comes with a cost because the testing algorithm delays the dispatching decisions, though this is under control. In this paper, we analyze the impact of such extra cost in a load balancing setting by investigating the following questions: does it really pay off to test jobs? If so, under which conditions? Under mild assumptions connecting the information extracted by the testing algorithm in relationship with its running time, we show that whether scheduling with testing brings a performance degradation or improvement strongly depends on the traffic conditions, system size and the coefficient of variation of job sizes. Thus, the general answer to the above questions is non-trivial and some care should be considered when deploying a testing policy. Our results are achieved by proposing a load balancing model for scheduling with testing that we analyze in two limiting regimes. When the number of servers grows to infinity in proportion to the network demand, we show that job-size testing actually degrades performance unless short jobs can be predicted reliably almost instantaneously and the network load is sufficiently high. When the coefficient of variation of job sizes grows to infinity, we construct testing policies inducing an arbitrarily large performance gain with respect to running jobs untested

    Big data and hydroinformatics

    Get PDF
    Big data is popular in the areas of computer science, commerce and bioinformatics, but is in an early stage in hydroinformatics. Big data is originated from the extremely large datasets that cannot be processed in tolerable elapsed time with the traditional data processing methods. Using the analogy from the object-oriented programming, big data should be considered as objects encompassing the data, its characteristics and the processing methods. Hydroinformatics can benefit from the big data technology with newly emerged data, techniques and analytical tools to handle large datasets, from which creative ideas and new values could be mined. This paper provides a timely review on big data with its relevance to hydroinformatics. A further exploration on precipitation big data is discussed because estimation of precipitation is an important part of hydrology for managing floods and droughts, and understanding the global water cycle. It is promising that fusion of precipitation data from remote sensing, weather radar, rain gauge and numerical weather modelling could be achieved by parallel computing and distributed data storage, which will trigger a leap in precipitation estimation as the available data from multiple sources could be fused to generate a better product than those from single sources.</jats:p

    Constraint Programming-based Job Dispatching for Modern HPC Applications

    Get PDF
    A High-Performance Computing job dispatcher is a critical software that assigns the finite computing resources to submitted jobs. This resource assignment over time is known as the on-line job dispatching problem in HPC systems. The fact the problem is on-line means that solutions must be computed in real-time, and their required time cannot exceed some threshold to do not affect the normal system functioning. In addition, a job dispatcher must deal with a lot of uncertainty: submission times, the number of requested resources, and duration of jobs. Heuristic-based techniques have been broadly used in HPC systems, at the cost of achieving (sub-)optimal solutions in a short time. However, the scheduling and resource allocation components are separated, thus generates a decoupled decision that may cause a performance loss. Optimization-based techniques are less used for this problem, although they can significantly improve the performance of HPC systems at the expense of higher computation time. Nowadays, HPC systems are being used for modern applications, such as big data analytics and predictive model building, that employ, in general, many short jobs. However, this information is unknown at dispatching time, and job dispatchers need to process large numbers of them quickly while ensuring high Quality-of-Service (QoS) levels. Constraint Programming (CP) has been shown to be an effective approach to tackle job dispatching problems. However, state-of-the-art CP-based job dispatchers are unable to satisfy the challenges of on-line dispatching, such as generate dispatching decisions in a brief period and integrate current and past information of the housing system. Given the previous reasons, we propose CP-based dispatchers that are more suitable for HPC systems running modern applications, generating on-line dispatching decisions in a proper time and are able to make effective use of job duration predictions to improve QoS levels, especially for workloads dominated by short jobs

    Flight Tests of a Remaining Flying Time Prediction System for Small Electric Aircraft in the Presence of Faults

    Get PDF
    This paper addresses the problem of building trust in the online prediction of a battery powered aircraft's remaining flying time. A series of flight tests is described that make use of a small electric powered unmanned aerial vehicle (eUAV) to verify the performance of the remaining flying time prediction algorithm. The estimate of remaining flying time is used to activate an alarm when the predicted remaining time is two minutes. This notifies the pilot to transition to the landing phase of the flight. A second alarm is activated when the battery charge falls below a specified limit threshold. This threshold is the point at which the battery energy reserve would no longer safely support two repeated aborted landing attempts. During the test series, the motor system is operated with the same predefined timed airspeed profile for each test. To test the robustness of the prediction, half of the tests were performed with, and half were performed without, a simulated powertrain fault. The pilot remotely engages a resistor bank at a specified time during the test flight to simulate a partial powertrain fault. The flying time prediction system is agnostic of the pilot's activation of the fault and must adapt to the vehicle's state. The time at which the limit threshold on battery charge is reached is then used to measure the accuracy of the remaining flying time predictions. Accuracy requirements for the alarms are considered and the results discussed
    • …