11,839 research outputs found

    Emergent Scheduling of Distributed Execution Frameworks

    Get PDF
    Distributed execution Frameworks (DEFs) provide a platform for handling the increasing volume of data available to distributed computational processes, forming the creation and usage of a large number of DEFs for performing distributed computations. For example, sorting and analyzing large data sets through map and reduce operations, performing a set of operations across points in a data stream to provide near real-time analysis, and the training and testing of machine learning models for varying methods of learning, such as, supervised, unsupervised and reinforcement learning, exploiting the vast amounts of data available. Leading to varying DEFs becoming optimal for either fine or coarse grained computations, for example Apache Spark provides a framework for coarse grained data parallel processes providing data locality adding latency to scheduling decisions which would hinder performance of fine-grained computation. Whereas Ray and Apache Flink provide solutions to avoid the latency incurred by the scheduling method used by apache Spark while potentially incurring longer job completion times as data locality is no longer a priority. Therefore, this PhD will focus on overcoming the issue of trading performance for differing workloads by exploiting the capabilities presented by emergent software systems which learn how to assemble and re-assemble themselves in response to their current deployment conditions and input pattern. This allows the creation of a component based DEF capable of altering both the local behaviour of a DEF (i.e. Local Schedulers and placement polices within a centralised scheduler) to potentially improve the performance of single DEF as well as global behaviour of a DEF, for example the adaptation of a centralised to two-level scheduler

    The Design Space of Emergent Scheduling for Distributed Execution Frameworks

    Get PDF

    Efficient Task Replication for Fast Response Times in Parallel Computation

    Full text link
    One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel" computation. For this type of computation, one challenge is that the time to execute a task for each machine is inherently variable, and the overall response time is constrained by the execution time of the slowest machine. To address this issue, system designers introduce task replication, which sends the same task to multiple machines, and obtains result from the machine that finishes first. While task replication reduces response time, it usually increases resource usage. In this work, we propose a theoretical framework to analyze the trade-off between response time and resource usage. We show that, while in general, there is a tension between response time and resource usage, there exist scenarios where replicating tasks judiciously reduces completion time and resource usage simultaneously. Given the execution time distribution for machines, we investigate the conditions for a scheduling policy to achieve optimal performance trade-off, and propose efficient algorithms to search for optimal or near-optimal scheduling policies. Our analysis gives insights on when and why replication helps, which can be used to guide scheduler design in large-scale distributed computing systems.Comment: Extended version of the 2-page paper accepted to ACM SIGMETRICS 201

    The Repast Simulation/Modelling System for Geospatial Simulation

    Get PDF
    The use of simulation/modelling systems can simplify the implementation of agent-based models. Repast is one of the few simulation/modelling software systems that supports the integration of geospatial data especially that of vector-based geometries. This paper provides details about Repast specifically an overview, including its different development languages available to develop agent-based models. Before describing Repast’s core functionality and how models can be developed within it, specific emphasis will be placed on its ability to represent dynamics and incorporate geographical information. Once these elements of the system have been covered, a diverse list of Agent-Based Modelling (ABM) applications using Repast will be presented with particular emphasis on spatial applications utilizing Repast, in particular, those that utilize geospatial data

    Conceptual multi-agent system design for distributed scheduling systems

    Get PDF
    With the progressive increase in the complexity of dynamic environments, systems require an evolutionary configuration and optimization to meet the increased demand. In this sense, any change in the conditions of systems or products may require distributed scheduling and resource allocation of more elementary services. Centralized approaches might fall into bottleneck issues, becoming complex to adapt, especially in case of unexpected events. Thus, Multi-agent systems (MAS) can extract their automatic and autonomous behaviour to enhance the task effort distribution and support the scheduling decision-making. On the other hand, MAS is able to obtain quick solutions, through cooperation and smart control by agents, empowered by their coordination and interoperability. By leveraging an architecture that benefits of a collaboration with distributed artificial intelligence, it is proposed an approach based on a conceptual MAS design that allows distributed and intelligent management to promote technological innovation in basic concepts of society for more sustainable in everyday applications for domains with emerging needs, such as, manufacturing and healthcare scheduling systems.This work has been supported by FCT - Fundação para a Ciência e a Tecnologia within the R&D Units Projects Scope: UIDB/00319/2020 and UIDB/05757/2020. Filipe Alves is supported by FCT Doctorate Grant Reference SFRH/BD/143745/2019.info:eu-repo/semantics/publishedVersio

    Timely Long Tail Identification through Agent Based Monitoring and Analytics

    Get PDF
    The increasing complexity and scale of distributed systems has resulted in the manifestation of emergent behavior which substantially affects overall system performance. A significant emergent property is that of the "Long Tail", whereby a small proportion of task stragglers significantly impact job execution completion times. To mitigate such behavior, straggling tasks occurring within the system need to be accurately identified in a timely manner. However, current approaches focus on mitigation rather than identification, which typically identify stragglers too late in the execution lifecycle. This paper presents a method and tool to identify Long Tail behavior within distributed systems in a timely manner, through a combination of online and offline analytics. This is achieved through historical analysis to profile and model task execution patterns, which then inform online analytic agents that monitor task execution at runtime. Furthermore, we provide an empirical analysis of two large-scale production Cloud data enters that demonstrate the challenge of data skew within modern distributed systems, this analysis shows that approximately 5% of task stragglers caused by data skew impact 50% of the total jobs for batch processes. Our results demonstrate that our approach is capable of identifying task stragglers less than 11% into their execution lifecycle with 98% accuracy, signifying significant improvement over current state-of-the-art practice and enables far more effective mitigation strategies in large-scale distributed systems worldwide
    • …
    corecore