438 research outputs found
Intelligent Resource Prediction for HPC and Scientific Workflows
Scientific workflows and high-performance computing (HPC) platforms are critically important to modern scientific research. In order to perform scientific experiments at scale, domain scientists must have knowledge and expertise in software and hardware systems that are highly complex and rapidly evolving. While computational expertise will be essential for domain scientists going forward, any tools or practices that reduce this burden for domain scientists will greatly increase the rate of scientific discoveries. One challenge that exists for domain scientists today is knowing the resource usage patterns of an application for the purpose of resource provisioning. A tool that accurately estimates these resource requirements would benefit HPC users in many ways, by reducing job failures and queue times on traditional HPC platforms and reducing costs on cloud computing platforms. To that end, we present Tesseract, a semi-automated tool that predicts resource usage for any application on any computing platform, from historical data, with minimal input from the user. We employ Tesseract to predict runtime, memory usage, and disk usage for a diverse set of scientific workflows, and in particular we show how these resource estimates can prevent under-provisioning. Finally, we leverage this core prediction capability to develop solutions for the related challenges of anomaly detection, cross-platform runtime prediction, and cost prediction
Using Workload Prediction and Federation to Increase Cloud Utilization
The wide-spread adoption of cloud computing has changed how large-scale computing infrastructure is built and managed. Infrastructure-as-a-Service (IaaS) clouds consolidate different separate workloads onto a shared platform and provide a consistent quality of service by overprovisioning capacity. This additional capacity, however, remains idle for extended periods of time and represents a drag on system efficiency.The smaller scale of private IaaS clouds compared to public clouds exacerbates overprovisioning inefficiencies as opportunities for workload consolidation in private clouds are limited. Federation and cycle harvesting capabilities from computational grids help to improve efficiency, but to date have seen only limited adoption in the cloud due to a fundamental mismatch between the usage models of grids and clouds. Computational grids provide high throughput of queued batch jobs on a best-effort basis and enforce user priorities through dynamic job preemption, while IaaS clouds provide immediate feedback to user requests and make ahead-of-time guarantees about resource availability.We present a novel method to enable workload federation across IaaS clouds that overcomes this mismatch between grid and cloud usage models and improves system efficiency while also offering availability guarantees. We develop a new method for faster-than-realtime simulation of IaaS clouds to make predictions about system utilization and leverage this method to estimate the future availability of preemptible resources in the cloud. We then use these estimates to perform careful admission control and provide ahead-of-time bounds on the preemption probability of federated jobs executing on preemptible resources. Finally, we build an end-to-end prototype that addresses practical issues of workload federation and evaluate the prototype's efficacy using real-world traces from big data and compute-intensive production workloads
Novel Representation Learning Technique using Graphs for Performance Analytics
The performance analytics domain in High Performance Computing (HPC) uses
tabular data to solve regression problems, such as predicting the execution
time. Existing Machine Learning (ML) techniques leverage the correlations among
features given tabular datasets, not leveraging the relationships between
samples directly. Moreover, since high-quality embeddings from raw features
improve the fidelity of the downstream predictive models, existing methods rely
on extensive feature engineering and pre-processing steps, costing time and
manual effort. To fill these two gaps, we propose a novel idea of transforming
tabular performance data into graphs to leverage the advancement of Graph
Neural Network-based (GNN) techniques in capturing complex relationships
between features and samples. In contrast to other ML application domains, such
as social networks, the graph is not given; instead, we need to build it. To
address this gap, we propose graph-building methods where nodes represent
samples, and the edges are automatically inferred iteratively based on the
similarity between the features in the samples. We evaluate the effectiveness
of the generated embeddings from GNNs based on how well they make even a simple
feed-forward neural network perform for regression tasks compared to other
state-of-the-art representation learning techniques. Our evaluation
demonstrates that even with up to 25% random missing values for each dataset,
our method outperforms commonly used graph and Deep Neural Network (DNN)-based
approaches and achieves up to 61.67% & 78.56% improvement in MSE loss over the
DNN baseline respectively for HPC dataset and Machine Learning Datasets.Comment: This paper has been accepted at 22nd International Conference on
Machine Learning and Applications (ICMLA2023
Recommended from our members
Scheduling, Characterization and Prediction of HPC Workloads for Distributed Computing Environments
As High Performance Computing (HPC) has grown considerably and is expected to grow even more, effective resource management for distributed computing sys- tems is motivated more than ever. As the computational workloads grow in quantity, it is becoming more crucial to apply efficient resource management and workload scheduling to use resources efficiently while keeping the computational performance reasonably good. The problem of efficiently scheduling workloads on resources while meeting performance standards is hard. Additionally, non-clairvoyance of job dimen- sions makes resource management even harder in real-world scenarios. Our research methodology investigates the scheduling problem compliant for HPC and researches the challenges for deploying the scheduling in real world-scenarios using state of the art machine learning and data science techniques.To this end, this Ph.D. dissertation makes the following core contributions: a) We perform a theoretical analysis of space-sharing, non-preemptive scheduling: we studied this scheduling problem and proposed scheduling algorithms with polyno- mial computation time. We also proved constant upper-bounds for the performance of these algorithms. b) We studied the sensitivity of scheduling algorithms to the accuracy of runtime and devised a meta-learning approach to estimate prediction accuracy for newly submitted jobs to the HPC system. c) We studied the runtime prediction problem for HPC applications. For this purpose, we studied the distri- bution of available public workloads and proposed two different solutions that can predict multi-modal distributions: switching state-space models and Mixture Density Networks. d) We studied the effectiveness of recent recurrent neural network models for CPU usage trace prediction for individual VM traces as well as aggregate CPU usage traces. In this dissertation, we explore solutions to improve the performance of scheduling workloads on distributed systems.We begin by looking at the problem from the theoretical perspective. Modeling the problem mathematically, we first propose a scheduling algorithm that finds a constant approximation of the optimal solution for the problem in polynomial time. We prove that the performance of the algorithm (average completion time is the constant approximation of the performance of the optimal scheduling. We next look at the problem in real-world scenarios. Considering High-Performance Computing (HPC) workload computing environments as the most similar real-world equivalent of our mathematical model, we explore the problem of predicting application runtime. We propose an algorithm to handle the existing uncertainties in the real world and show-case our algorithm with demonstrative effectiveness in terms of response time and resource utilization. After looking at the uncertainty problem, we focus on trying to improve the accuracy of existing prediction approaches for HPC application runtime. We propose two solutions, one based on Kalman filters and one based on deep density mixture networks. We showcase the effectiveness of our prediction approaches by comparing with previous prediction approaches in terms of prediction accuracy and impact on improving scheduling performance. In the end, we focus on predicting resource usage for individual applications during their execution. We explore the application of recurrent neural networks for predicting resource usage of applications deployed on individual virtual machines. To validate our proposed models and solutions, we performed extensive trace-driven simulation and measured the effectiveness of our approaches
Load Balancing with Job-Size Testing: Performance Improvement or Degradation?
In the context of decision making under explorable uncertainty, scheduling
with testing is a powerful technique used in the management of computer systems
to improve performance via better job-dispatching decisions. Upon job arrival,
a scheduler may run some \emph{testing algorithm} against the job to extract
some information about its structure, e.g., its size, and properly classify it.
The acquisition of such knowledge comes with a cost because the testing
algorithm delays the dispatching decisions, though this is under control. In
this paper, we analyze the impact of such extra cost in a load balancing
setting by investigating the following questions: does it really pay off to
test jobs? If so, under which conditions? Under mild assumptions connecting the
information extracted by the testing algorithm in relationship with its running
time, we show that whether scheduling with testing brings a performance
degradation or improvement strongly depends on the traffic conditions, system
size and the coefficient of variation of job sizes. Thus, the general answer to
the above questions is non-trivial and some care should be considered when
deploying a testing policy. Our results are achieved by proposing a load
balancing model for scheduling with testing that we analyze in two limiting
regimes. When the number of servers grows to infinity in proportion to the
network demand, we show that job-size testing actually degrades performance
unless short jobs can be predicted reliably almost instantaneously and the
network load is sufficiently high. When the coefficient of variation of job
sizes grows to infinity, we construct testing policies inducing an arbitrarily
large performance gain with respect to running jobs untested
Big data and hydroinformatics
Big data is popular in the areas of computer science, commerce and bioinformatics, but is in an early stage in hydroinformatics. Big data is originated from the extremely large datasets that cannot be processed in tolerable elapsed time with the traditional data processing methods. Using the analogy from the object-oriented programming, big data should be considered as objects encompassing the data, its characteristics and the processing methods. Hydroinformatics can benefit from the big data technology with newly emerged data, techniques and analytical tools to handle large datasets, from which creative ideas and new values could be mined. This paper provides a timely review on big data with its relevance to hydroinformatics. A further exploration on precipitation big data is discussed because estimation of precipitation is an important part of hydrology for managing floods and droughts, and understanding the global water cycle. It is promising that fusion of precipitation data from remote sensing, weather radar, rain gauge and numerical weather modelling could be achieved by parallel computing and distributed data storage, which will trigger a leap in precipitation estimation as the available data from multiple sources could be fused to generate a better product than those from single sources.</jats:p
Constraint Programming-based Job Dispatching for Modern HPC Applications
A High-Performance Computing job dispatcher is a critical software that assigns the finite computing resources to submitted jobs. This resource assignment over time is known as the on-line job dispatching problem in HPC systems. The fact the problem is on-line means that solutions must be computed in real-time, and their required time cannot exceed some threshold to do not affect the normal system functioning. In addition, a job dispatcher must deal with a lot of uncertainty: submission times, the number of requested resources, and duration of jobs. Heuristic-based techniques have been broadly used in HPC systems, at the cost of achieving (sub-)optimal solutions in a short time. However, the scheduling and resource allocation components are separated, thus generates a decoupled decision that may cause a performance loss. Optimization-based techniques are less used for this problem, although they can significantly improve the performance of HPC systems at the expense of higher computation time.
Nowadays, HPC systems are being used for modern applications, such as big data analytics and predictive model building, that employ, in general, many short jobs. However, this information is unknown at dispatching time, and job dispatchers need to process large numbers of them quickly while ensuring high Quality-of-Service (QoS) levels. Constraint Programming (CP) has been shown to be an effective approach to tackle job dispatching problems. However, state-of-the-art CP-based job dispatchers are unable to satisfy the challenges of on-line dispatching, such as generate dispatching decisions in a brief period and integrate current and past information of the housing system.
Given the previous reasons, we propose CP-based dispatchers that are more suitable for HPC systems running modern applications, generating on-line dispatching decisions in a proper time and are able to make effective use of job duration predictions to improve QoS levels, especially for workloads dominated by short jobs
Flight Tests of a Remaining Flying Time Prediction System for Small Electric Aircraft in the Presence of Faults
This paper addresses the problem of building trust in the online prediction of a battery powered aircraft's remaining flying time. A series of flight tests is described that make use of a small electric powered unmanned aerial vehicle (eUAV) to verify the performance of the remaining flying time prediction algorithm. The estimate of remaining flying time is used to activate an alarm when the predicted remaining time is two minutes. This notifies the pilot to transition to the landing phase of the flight. A second alarm is activated when the battery charge falls below a specified limit threshold. This threshold is the point at which the battery energy reserve would no longer safely support two repeated aborted landing attempts. During the test series, the motor system is operated with the same predefined timed airspeed profile for each test. To test the robustness of the prediction, half of the tests were performed with, and half were performed without, a simulated powertrain fault. The pilot remotely engages a resistor bank at a specified time during the test flight to simulate a partial powertrain fault. The flying time prediction system is agnostic of the pilot's activation of the fault and must adapt to the vehicle's state. The time at which the limit threshold on battery charge is reached is then used to measure the accuracy of the remaining flying time predictions. Accuracy requirements for the alarms are considered and the results discussed
Recommended from our members
Scalable Systems for Large Scale Dynamic Connected Data Processing
As the proliferation of sensors rapidly make the Internet-of-Things (IoT) a reality, the devices and sensors in this ecosystem—such as smartphones, video cameras, home automation systems, and autonomous vehicles—constantly map out the real-world producing unprecedented amounts of dynamic, connected data that captures complex and diverse relations. Unfortunately, existing big data processing and machine learning frameworks are ill-suited for analyzing such dynamic connected data and face several challenges when employed for this purpose.This dissertation focuses on the design and implementation of scalable systems for dynamic connected data processing. We discuss simple abstractions that make it easy to operate on such data, efficient data structures for state management, and computation models that reduce redundant work. We also describe how bridging theory and practice with algorithms and techniques that leverage approximation and streaming theory can significantly speed up connected data computations. The systems described in this dissertation achieve more than an order of magnitude improvement over the state-of-the-art
- …