19 research outputs found
Recommended from our members
Scheduling, Characterization and Prediction of HPC Workloads for Distributed Computing Environments
As High Performance Computing (HPC) has grown considerably and is expected to grow even more, effective resource management for distributed computing sys- tems is motivated more than ever. As the computational workloads grow in quantity, it is becoming more crucial to apply efficient resource management and workload scheduling to use resources efficiently while keeping the computational performance reasonably good. The problem of efficiently scheduling workloads on resources while meeting performance standards is hard. Additionally, non-clairvoyance of job dimen- sions makes resource management even harder in real-world scenarios. Our research methodology investigates the scheduling problem compliant for HPC and researches the challenges for deploying the scheduling in real world-scenarios using state of the art machine learning and data science techniques.To this end, this Ph.D. dissertation makes the following core contributions: a) We perform a theoretical analysis of space-sharing, non-preemptive scheduling: we studied this scheduling problem and proposed scheduling algorithms with polyno- mial computation time. We also proved constant upper-bounds for the performance of these algorithms. b) We studied the sensitivity of scheduling algorithms to the accuracy of runtime and devised a meta-learning approach to estimate prediction accuracy for newly submitted jobs to the HPC system. c) We studied the runtime prediction problem for HPC applications. For this purpose, we studied the distri- bution of available public workloads and proposed two different solutions that can predict multi-modal distributions: switching state-space models and Mixture Density Networks. d) We studied the effectiveness of recent recurrent neural network models for CPU usage trace prediction for individual VM traces as well as aggregate CPU usage traces. In this dissertation, we explore solutions to improve the performance of scheduling workloads on distributed systems.We begin by looking at the problem from the theoretical perspective. Modeling the problem mathematically, we first propose a scheduling algorithm that finds a constant approximation of the optimal solution for the problem in polynomial time. We prove that the performance of the algorithm (average completion time is the constant approximation of the performance of the optimal scheduling. We next look at the problem in real-world scenarios. Considering High-Performance Computing (HPC) workload computing environments as the most similar real-world equivalent of our mathematical model, we explore the problem of predicting application runtime. We propose an algorithm to handle the existing uncertainties in the real world and show-case our algorithm with demonstrative effectiveness in terms of response time and resource utilization. After looking at the uncertainty problem, we focus on trying to improve the accuracy of existing prediction approaches for HPC application runtime. We propose two solutions, one based on Kalman filters and one based on deep density mixture networks. We showcase the effectiveness of our prediction approaches by comparing with previous prediction approaches in terms of prediction accuracy and impact on improving scheduling performance. In the end, we focus on predicting resource usage for individual applications during their execution. We explore the application of recurrent neural networks for predicting resource usage of applications deployed on individual virtual machines. To validate our proposed models and solutions, we performed extensive trace-driven simulation and measured the effectiveness of our approaches
Attention-based machine perception for intelligent cyber-physical systems
Cyber-physical systems (CPS) fundamentally change the way of how information systems interact with the physical world. They integrate the sensing, computing, and communication capabilities on heterogeneous platforms and infrastructures. Efficient and effective perception of the environment lays the foundation of proper operations in other CPS components (e.g., planning and control). Recent advances in artificial intelligence (AI) have unprecedentedly changed the way of how cyber systems extract knowledge from the collected sensing data, and understand the physical surroundings. This novel data-to-knowledge transformation capability pushes a wide spectrum of recognition tasks (e.g., visual object detection, speech recognition, and sensor-based human activity recognition) to a higher level, and opens an new era of intelligent cyber-physical systems. However, the state-of-the-art neural perception models are typically computation-intensive and sensitive to data noises, which induce significant challenges when they are deployed on resources-limited embedded platforms.
This dissertation works on optimizing both the efficiency and efficacy of deep-neural- network (DNN)-based machine perception in intelligent cyber-physical systems. We extensively exploit and apply the design philosophy of attention, originated from cognitive psychology field, from multiple perspectives of machine perception. It generally means al- locating different degrees of concentration to different perceived stimuli. Specifically, we address the following five research questions: First, can we run the computation-intensive neural perception models in real-time by only looking at (i.e., scheduling) the important parts of the perceived scenes, with the cueing from an external sensor? Second, can we eliminate the dependency on the external cueing and make the scheduling framework a self- cueing system? Third, how to distribute the workloads among cameras in a distributed (visual) perception system, where multiple cameras can observe the same parts of the environment? Fourth, how to optimize the achieved perception quality when sensing data from heterogeneous locations and sensor types are collected and utilized? Fifth, how to handle sensor failures in a distributed sensing system, when the deployed neural perception models are sensitive to missing data?
We formulate the above problems, and introduce corresponding attention-based solutions for each, to construct the fundamental building blocks for envisioning an attention-based machine perception system in intelligent CPS with both efficiency and efficacy guarantees
Embedding Games
Large scale distributed computing infrastructures pose challenging resource management problems, which could be addressed by adopting one of two perspectives. On the one hand, the problem could be framed as a global optimization that aims to minimize some notion of system-wide (social) cost. On the other hand, the problem could be framed in a game-theoretic setting whereby rational, selfish users compete for a share of the resources so as to maximize their private utilities with little or no regard for system-wide objectives. This game-theoretic setting is particularly applicable to emerging cloud and grid environments, testbed platforms, and many networking applications. By adopting the first, global optimization perspective, this thesis presents NetEmbed: a framework, associated mechanisms, and implementations that enable the mapping of requested configurations to available infrastructure resources. By adopting the second, game-theoretic perspective, this thesis defines and establishes the premises of two resource acquisition mechanisms: Colocation Games and Trade and Cap. Colocation Games enable the modeling and analysis of the dynamics that result when rational, selfish parties interact in an attempt to minimize the individual costs they incur to secure shared resources necessary to support their application QoS or SLA requirements. Trade and Cap is a market-based scheduling and load-balancing mechanism that facilitates the trading of resources when users have a mixture of rigid and fluid jobs, and incentivizes users to behave in ways that result in better load-balancing of shared resources. In addition to developing their analytical underpinnings, this thesis establishes the viability of NetEmbed, Colocation Games, and Trade and Cap by presenting implementation blueprints and experimental results for many variants of these mechanisms. The results presented in this thesis pave the way for the development of economically-sound resource acquisition and management solutions in two emerging, and increasingly important settings. In pay-as-you-go settings, where pricing is based on usage, this thesis anticipates new service offerings that enable efficient marketplaces in the presence of non-cooperative, selfish agents. In settings where pricing is not a function of usage, this thesis anticipates the development of service offerings that enable trading of usage rights to maximize the utility of a shared infrastructure to its tenants
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum