191 research outputs found

    Scheduling Recurring Distributed Dataflow Jobs Based on Resource Utilization and Interference

    Get PDF
    Resource management systems like YARN or Mesos enable users to share cluster infrastructures by running analytics jobs in temporarily reserved containers. These containers are typically not isolated to achieve high degrees of overall resource utilizations despite the often fluctuating resource usage of single analytic jobs. However, some combinations of jobs utilize the resources better and interfere less with each others when running on the same nodes than others. This paper presents an approach for improving the resource utilization and job throughput when scheduling recurring data analysis jobs in shared cluster environments. Using a reinforcement learning algorithm, the scheduler continuously learns which jobs are best executed simultaneously on the cluster. Our evaluation of an implementation built on Hadoop YARN shows that this approach can increase resource utilization and decrease job runtimes. While interference between jobs can be avoided, co-locations of jobs with complementary resource usage are not yet always fully recognized. However, with a better measure of co-location goodness, our solution can be used to automatically adapt the scheduling to workloads with recurring batch jobs

    Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary Data-Parallel Jobs

    Full text link
    Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage of these jobs also typically fluctuates considerably. Therefore, multiple jobs usually get scheduled onto the same shared resources to increase the resource utilization and throughput of clusters. However, job runtimes and the utilization of shared resources can vary significantly depending on the specific combinations of co-located jobs. This paper presents Hugo, a cluster scheduler that continuously learns how efficiently jobs share resources, considering metrics for the resource utilization and interference among co-located jobs. The scheduler combines offline grouping of jobs with online reinforcement learning to provide a scheduling mechanism that efficiently generalizes from specific monitored job combinations yet also adapts to changes in workloads. Our evaluation of a prototype shows that the approach can reduce the runtimes of exemplary Spark jobs on a YARN cluster by up to 12.5%, while resource utilization is increased and waiting times can be bounded

    Adaptive Resource Management for Distributed Data Analytics

    Get PDF
    Increasingly large datasets make scalable and distributed data analytics necessary. Frameworks such as Spark and Flink help users in efficiently utilizing cluster resources for their data analytics jobs. It is, however, usually difficult to anticipate the runtime behavior and resource demands of these distributed data analytics jobs. Yet, many resource management decisions would benefit from such information. Addressing this general problem, this chapter presents our vision of adaptive resource management and reviews recent work in this area. The key idea is that workloads should be monitored for trends, patterns, and recurring jobs. These monitoring statistics should be analyzed and used for a cluster resource management calibrated to the actual workload. In this chapter, we motivate and present the idea of adaptive resource management. We also introduce a general system architecture and we review specific adaptive techniques for data placement, resource allocation, and job scheduling in the context of our architecture

    Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters

    Get PDF
    Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters. Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks. We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters. In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues. We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads. Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 . Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads

    Resource optimization of edge servers dealing with priority-based workloads by utilizing service level objective-aware virtual rebalancing

    Get PDF
    IoT enables profitable communication between sensor/actuator devices and the cloud. Slow network causing Edge data to lack Cloud analytics hinders real-time analytics adoption. VRebalance solves priority-based workload performance for stream processing at the Edge. BO is used in VRebalance to prioritize workloads and find optimal resource configurations for efficient resource management. Apache Storm platform was used with RIoTBench IoT benchmark tool for real-time stream processing. Tools were used to evaluate VRebalance. Study shows VRebalance is more effective than traditional methods, meeting SLO targets despite system changes. VRebalance decreased SLO violation rates by almost 30% for static priority-based workloads and 52.2% for dynamic priority-based workloads compared to hill climbing algorithm. Using VRebalance decreased SLO violations by 66.1% compared to Apache Storm\u27s default allocation

    MCFlow: Middleware for Mixed-Criticality Distributed Real-Time Systems

    Get PDF
    Traditional fixed-priority scheduling analysis for periodic/sporadic task sets is based on the assumption that all tasks are equally critical to the correct operation of the system. Therefore, every task has to be schedulable under the scheduling policy, and estimates of tasks\u27 worst case execution times must be conservative in case a task runs longer than is usual. To address the significant under-utilization of a system\u27s resources under normal operating conditions that can arise from these assumptions, several \emph{mixed-criticality scheduling} approaches have been proposed. However, to date there has been no quantitative comparison of system schedulability or run-time overhead for the different approaches. In this dissertation, we present what is to our knowledge the first side-by-side implementation and evaluation of those approaches, for periodic and sporadic mixed-criticality tasks on uniprocessor or distributed systems, under a mixed-criticality scheduling model that is common to all these approaches. To make a fair evaluation of mixed-criticality scheduling, we also address some previously open issues and propose modifications to improve schedulability and correctness of particular approaches. To facilitate the development and evaluation of mixed-criticality applications, we have designed and developed a distributed real-time middleware, called MCFlow, for mixed-criticality end-to-end tasks running on multi-core platforms. The research presented in this dissertation provides the following contributions to the state of the art in real-time middleware: (1) an efficient component model through which dependent subtask graphs can be configured flexibly for execution within a single core, across cores of a common host, or spanning multiple hosts; (2) support for optimizations to inter-component communication to reduce data copying without sacrificing the ability to execute subtasks in parallel; (3) a strict separation of timing and functional concerns so that they can be configured independently; (4) an event dispatching architecture that uses lock free algorithms where possible to reduce memory contention, CPU context switching, and priority inversion; and (5) empirical evaluations of MCFlow itself and of different mixed criticality scheduling approaches both with a single host and end-to-end across multiple hosts. The results of our evaluation show that in terms of basic distributed real-time behavior MCFlow performs comparably to the state of the art TAO real-time object request broker when only one core is used and outperforms TAO when multiple cores are involved. We also identify and categorize different use cases under which different mixed criticality scheduling approaches are preferable

    Artificial intelligence driven anomaly detection for big data systems

    Get PDF
    The main goal of this thesis is to contribute to the research on automated performance anomaly detection and interference prediction by implementing Artificial Intelligence (AI) solutions for complex distributed systems, especially for Big Data platforms within cloud computing environments. The late detection and manual resolutions of performance anomalies and system interference in Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose AI-based methodologies for anomaly detection and interference prediction tailored to Big Data and containerized batch platforms to better analyze system performance and effectively utilize computing resources within cloud environments. Therefore, new precise and efficient performance management methods are the key to handling performance anomalies and interference impacts to improve the efficiency of data center resources. The first part of this thesis contributes to performance anomaly detection for in-memory Big Data platforms. We examine the performance of Big Data platforms and justify our choice of selecting the in-memory Apache Spark platform. An artificial neural network-driven methodology is proposed to detect and classify performance anomalies for batch workloads based on the RDD characteristics and operating system monitoring metrics. Our method is evaluated against other popular machine learning algorithms (ML), as well as against four different monitoring datasets. The results prove that our proposed method outperforms other ML methods, typically achieving 98–99% F-scores. Moreover, we prove that a random start instant, a random duration, and overlapped anomalies do not significantly impact the performance of our proposed methodology. The second contribution addresses the challenge of anomaly identification within an in-memory streaming Big Data platform by investigating agile hybrid learning techniques. We develop TRACK (neural neTwoRk Anomaly deteCtion in sparK) and TRACK-Plus, two methods to efficiently train a class of machine learning models for performance anomaly detection using a fixed number of experiments. Our model revolves around using artificial neural networks with Bayesian Optimization (BO) to find the optimal training dataset size and configuration parameters to efficiently train the anomaly detection model to achieve high accuracy. The objective is to accelerate the search process for finding the size of the training dataset, optimizing neural network configurations, and improving the performance of anomaly classification. A validation based on several datasets from a real Apache Spark Streaming system is performed, demonstrating that the proposed methodology can efficiently identify performance anomalies, near-optimal configuration parameters, and a near-optimal training dataset size while reducing the number of experiments up to 75% compared with naïve anomaly detection training. The last contribution overcomes the challenges of predicting completion time of containerized batch jobs and proactively avoiding performance interference by introducing an automated prediction solution to estimate interference among colocated batch jobs within the same computing environment. An AI-driven model is implemented to predict the interference among batch jobs before it occurs within system. Our interference detection model can alleviate and estimate the task slowdown affected by the interference. This model assists the system operators in making an accurate decision to optimize job placement. Our model is agnostic to the business logic internal to each job. Instead, it is learned from system performance data by applying artificial neural networks to establish the completion time prediction of batch jobs within the cloud environments. We compare our model with three other baseline models (queueing-theoretic model, operational analysis, and an empirical method) on historical measurements of job completion time and CPU run-queue size (i.e., the number of active threads in the system). The proposed model captures multithreading, operating system scheduling, sleeping time, and job priorities. A validation based on 4500 experiments based on the DaCapo benchmarking suite was carried out, confirming the predictive efficiency and capabilities of the proposed model by achieving up to 10% MAPE compared with the other models.Open Acces

    Integrating dataflow and non-dataflow real-time application models on multi-core platforms

    Get PDF
    Ferry Jules. Facultés des sciences. Rapports sur les travaux personnels des professeurs. In: Bulletin administratif de l'instruction publique. Tome 24 n°465, 1881. pp. 529-531

    Towards Efficient Resource Allocation for Embedded Systems

    Get PDF
    Das Hauptthema ist die dynamische Ressourcenverwaltung in eingebetteten Systemen, insbesondere die Verwaltung von Rechenzeit und Netzwerkverkehr auf einem MPSoC. Die Idee besteht darin, eine Pipeline für die Verarbeitung von Mobiler Kommunikation auf dem Chip dynamisch zu schedulen, um die Effizienz der Hardwareressourcen zu verbessern, ohne den Ressourcenverbrauch des dynamischen Schedulings dramatisch zu erhöhen. Sowohl Software- als auch Hardwaremodule werden auf Hotspots im Ressourcenverbrauch untersucht und optimiert, um diese zu entfernen. Da Applikationen im Bereich der Signalverarbeitung normalerweise mit Hilfe von SDF-Diagrammen beschrieben werden können, wird deren dynamisches Scheduling optimiert, um den Ressourcenverbrauch gegenüber dem üblicherweise verwendeten statischen Scheduling zu verbessern. Es wird ein hybrider dynamischer Scheduler vorgestellt, der die Vorteile von Processing-Networks und der Planung von Task-Graphen kombiniert. Es ermöglicht dem Scheduler, ein Gleichgewicht zwischen der Parallelisierung der Berechnung und der Zunahme des dynamischen Scheduling-Aufands optimal abzuwägen. Der resultierende dynamisch erstellte Schedule reduziert den Ressourcenverbrauch um etwa 50%, wobei die Laufzeit im Vergleich zu einem statischen Schedule nur um 20% erhöht wird. Zusätzlich wird ein verteilter dynamischer SDF-Scheduler vorgeschlagen, der das Scheduling in verschiedene Teile zerlegt, die dann zu einer Pipeline verbunden werden, um mehrere parallele Prozessoren einzubeziehen. Jeder Scheduling-Teil wird zu einem Cluster mit Load-Balancing erweitert, um die Anzahl der parallel laufenden Scheduling-Jobs weiter zu erhöhen. Auf diese Weise wird dem vorhandene Engpass bei dem dynamischen Scheduling eines zentralisierten Schedulers entgegengewirkt, sodass 7x mehr Prozessoren mit dem Pipelined-Clustered-Dynamic-Scheduler für eine typische Signalverarbeitungsanwendung verwendet werden können. Das neue dynamische Scheduling-System setzt das Vorhandensein von drei verschiedenen Kommunikationsmodi zwischen den Verarbeitungskernen voraus. Bei der Emulation auf Basis des häufig verwendeten RDMA-Protokolls treten Leistungsprobleme auf. Sehr gut kann RDMA für einmalige Punkt-zu-Punkt-Datenübertragungen verwendet werden, wie sie bei der Ausführung von Task-Graphen verwendet werden. Process-Networks verwenden normalerweise Datenströme mit hohem Volumen und hoher Bandbreite. Es wird eine FIFO-basierte Kommunikationslösung vorgestellt, die einen zyklischen Puffer sowohl im Sender als auch im Empfänger implementiert, um diesen Bedarf zu decken. Die Pufferbehandlung und die Datenübertragung zwischen ihnen erfolgen ausschließlich in Hardware, um den Software-Overhead aus der Anwendung zu entfernen. Die Implementierung verbessert die Zugriffsverwaltung mehrerer Nutzer auf flächen-effiziente Single-Port Speichermodule. Es werden 0,8 der theoretisch möglichen Bandbreite, die normalerweise nur mit flächenmäßig teureren Dual-Port-Speichern erreicht wird. Der dritte Kommunikationsmodus definiert eine einfache Message-Passing-Implementierung, die ohne einen Verbindungszustand auskommt. Dieser Modus wird für eine effiziente prozessübergreifende Kommunikation des verteilten Scheduling-Systems und der engen Ansteuerung der restlichen Prozessoren benötigt. Eine Flusskontrolle in Hardware stellt sicher, dass eine große Anzahl von Sendern Nachrichten an denselben Empfänger senden kann. Dabei wird garantiert, dass alle Nachrichten korrekt empfangen werden, ohne dass eine Verbindung hergestellt werden muss und die Nachrichtenlaufzeit gering bleibt. Die Arbeit konzentriert sich auf die Optimierung des Codesigns von Hardware und Software, um die kompromisslose Ressourceneffizienz der dynamischen SDF-Graphen-Planung zu erhöhen. Besonderes Augenmerk wird auf die Abhängigkeiten zwischen den Ebenen eines verteilten Scheduling-Systems gelegt, das auf der Verfügbarkeit spezifischer hardwarebeschleunigter Kommunikationsmethoden beruht.:1 Introduction 1.1 Motivation 1.2 The Multiprocessor System on Chip Architecture 1.3 Concrete MPSoC Architecture 1.4 Representing LTE/5G baseband processing as Static Data Flow 1.5 Compuation Stack 1.6 Performance Hotspots Addressed 1.7 State of the Art 1.8 Overview of the Work 2 Hybrid SDF Execution 2.1 Addressed Performance Hotspot 2.2 State of the Art 2.3 Static Data Flow Graphs 2.4 Runtime Environment 2.5 Overhead of Deloying Tasks to a MPSoC 2.6 Interpretation of SDF Graphs as Task Graphs 2.7 Interpreting SDF Graphs as Process Networks 2.8 Hybrid Interpretation 2.9 Graph Topology Considerations 2.10 Theoretic Impact of Hybrid Interpretation 2.11 Simulating Hybrid Execution 2.12 Pipeline SDF Graph Example 2.13 Random SDF Graphs 2.14 LTE-like SDF Graph 2.15 Key Lernings 3 Distribution of Management 3.1 Addressed Performance Hotspot 3.2 State of the Art 3.3 Revising Deployment Overhead 3.4 Distribution of Overhead 3.5 Impact of Management Distribution to Resource Utilization 3.6 Reconfigurability 3.7 Key Lernings 4 Sliced FIFO Hardware 4.1 Addressed Performance Hotspot 4.2 State of the Art 4.3 System Environment 4.4 Sliced Windowed FIFO buffer 4.5 Single FIFO Evaluation 4.6 Multiple FIFO Evalutaion 4.7 Hardware Implementation 4.8 Key Lernings 5 Message Passing Hardware 5.1 Addressed Performance Hotspot 5.2 State of the Art 5.3 Message Passing Regarded as Queueing 5.4 A Remote Direct Memory Access Based Implementation 5.5 Hardware Implementation Concept 5.6 Evalutation of Performance 5.7 Key Lernings 6 SummaryThe main topic is the dynamic resource allocation in embedded systems, especially the allocation of computing time and network traffic on an multi processor system on chip (MPSoC). The idea is to dynamically schedule a mobile communication signal processing pipeline on the chip to improve hardware resource efficiency while not dramatically improve resource consumption because of dynamic scheduling overhead. Both software and hardware modules are examined for resource consumption hotspots and optimized to remove them. Since signal processing can usually be described with the help of static data flow (SDF) graphs, the dynamic handling of those is optimized to improve resource consumption over the commonly used static scheduling approach. A hybrid dynamic scheduler is presented that combines benefits from both processing networks and task graph scheduling. It allows the scheduler to optimally balance parallelization of computation and addition of dynamic scheduling overhead. The resulting dynamically created schedule reduces resource consumption by about 50%, with a runtime increase of only 20% compared to a static schedule. Additionally, a distributed dynamic SDF scheduler is proposed that splits the scheduling into different parts, which are then connected to a scheduling pipeli ne to incorporate multiple parallel working processors. Each scheduling stage is reworked into a load-balanced cluster to increase the number of parallel scheduling jobs further. This way, the still existing dynamic scheduling bottleneck of a centralized scheduler is widened, allowing handling 7x more processors with the pipelined, clustered dynamic scheduler for a typical signal processing application. The presented dynamic scheduling system assumes the presence of three different communication modes between the processing cores. When emulated on top of the commonly used remote direct memory access (RDMA) protocol, performance issues are encountered. Firstly, RDMA can neatly be used for single-shot point-to-point data transfers, like used in task graph scheduling. Process networks usually make use of high-volume and high-bandwidth data streams. A first in first out (FIFO) communication solution is presented that implements a cyclic buffer on both sender and receiver to serve this need. The buffer handling and data transfer between them are done purely in hardware to remove software overhead from the application. The implementation improves the multi-user access to area-efficient single port on-chip memory modules. It achieves 0.8 of the theoretically possible bandwidth, usually only achieved with area expensive dual-port memories. The third communication mode defines a lightweight message passing (MP) implementation that is truly connectionless. It is needed for efficient inter-process communication of the distributed and clustered scheduling system and the worker processing units’ tight coupling. A hardware flow control assures that an arbitrary number of senders can spontaneously start sending messages to the same receiver. Yet, all messages are guaranteed to be correctly received while eliminating the need for connection establishment and keeping a low message delay. The work focuses on the hardware-software codesign optimization to increase the uncompromised resource efficiency of dynamic SDF graph scheduling. Special attention is paid to the inter-level dependencies in developing a distributed scheduling system, which relies on the availability of specific hardwareaccelerated communication methods.:1 Introduction 1.1 Motivation 1.2 The Multiprocessor System on Chip Architecture 1.3 Concrete MPSoC Architecture 1.4 Representing LTE/5G baseband processing as Static Data Flow 1.5 Compuation Stack 1.6 Performance Hotspots Addressed 1.7 State of the Art 1.8 Overview of the Work 2 Hybrid SDF Execution 2.1 Addressed Performance Hotspot 2.2 State of the Art 2.3 Static Data Flow Graphs 2.4 Runtime Environment 2.5 Overhead of Deloying Tasks to a MPSoC 2.6 Interpretation of SDF Graphs as Task Graphs 2.7 Interpreting SDF Graphs as Process Networks 2.8 Hybrid Interpretation 2.9 Graph Topology Considerations 2.10 Theoretic Impact of Hybrid Interpretation 2.11 Simulating Hybrid Execution 2.12 Pipeline SDF Graph Example 2.13 Random SDF Graphs 2.14 LTE-like SDF Graph 2.15 Key Lernings 3 Distribution of Management 3.1 Addressed Performance Hotspot 3.2 State of the Art 3.3 Revising Deployment Overhead 3.4 Distribution of Overhead 3.5 Impact of Management Distribution to Resource Utilization 3.6 Reconfigurability 3.7 Key Lernings 4 Sliced FIFO Hardware 4.1 Addressed Performance Hotspot 4.2 State of the Art 4.3 System Environment 4.4 Sliced Windowed FIFO buffer 4.5 Single FIFO Evaluation 4.6 Multiple FIFO Evalutaion 4.7 Hardware Implementation 4.8 Key Lernings 5 Message Passing Hardware 5.1 Addressed Performance Hotspot 5.2 State of the Art 5.3 Message Passing Regarded as Queueing 5.4 A Remote Direct Memory Access Based Implementation 5.5 Hardware Implementation Concept 5.6 Evalutation of Performance 5.7 Key Lernings 6 Summar
    corecore