11,839 research outputs found
Emergent Scheduling of Distributed Execution Frameworks
Distributed execution Frameworks (DEFs) provide a platform for handling the increasing volume of data available to distributed computational processes, forming the creation and usage of a large number of DEFs for performing distributed computations. For example, sorting and analyzing large data sets through map and reduce operations, performing a set of operations across points in a data stream to provide near real-time analysis, and the training and testing of machine learning models for varying methods of learning, such as, supervised, unsupervised and reinforcement learning, exploiting the vast amounts of data available. Leading to varying DEFs becoming optimal for either fine or coarse grained computations, for example Apache Spark provides a framework for coarse grained data parallel processes providing data locality adding latency to scheduling decisions which would hinder performance of fine-grained computation. Whereas Ray and Apache Flink provide solutions to avoid the latency incurred by the scheduling method used by apache Spark while potentially incurring longer job completion times as data locality is no longer a priority. Therefore, this PhD will focus on overcoming the issue of trading performance for differing workloads by exploiting the capabilities presented by emergent software systems which learn how to assemble and re-assemble themselves in response to their current deployment conditions and input pattern. This allows the creation of a component based DEF capable of altering both the local behaviour of a DEF (i.e. Local Schedulers and placement polices within a centralised scheduler) to potentially improve the performance of single DEF as well as global behaviour of a DEF, for example the adaptation of a centralised to two-level scheduler
Efficient Task Replication for Fast Response Times in Parallel Computation
One typical use case of large-scale distributed computing in data centers is
to decompose a computation job into many independent tasks and run them in
parallel on different machines, sometimes known as the "embarrassingly
parallel" computation. For this type of computation, one challenge is that the
time to execute a task for each machine is inherently variable, and the overall
response time is constrained by the execution time of the slowest machine. To
address this issue, system designers introduce task replication, which sends
the same task to multiple machines, and obtains result from the machine that
finishes first. While task replication reduces response time, it usually
increases resource usage. In this work, we propose a theoretical framework to
analyze the trade-off between response time and resource usage. We show that,
while in general, there is a tension between response time and resource usage,
there exist scenarios where replicating tasks judiciously reduces completion
time and resource usage simultaneously. Given the execution time distribution
for machines, we investigate the conditions for a scheduling policy to achieve
optimal performance trade-off, and propose efficient algorithms to search for
optimal or near-optimal scheduling policies. Our analysis gives insights on
when and why replication helps, which can be used to guide scheduler design in
large-scale distributed computing systems.Comment: Extended version of the 2-page paper accepted to ACM SIGMETRICS 201
The Repast Simulation/Modelling System for Geospatial Simulation
The use of simulation/modelling systems can simplify the implementation of agent-based models. Repast is one of the few simulation/modelling software systems that supports the integration of geospatial data especially that of vector-based geometries. This paper provides details about Repast specifically an overview, including its different development languages available to develop agent-based models. Before describing Repast’s core functionality and how models can be developed within it, specific emphasis will be placed on its ability to represent dynamics and incorporate geographical information. Once these elements of the system have been covered, a diverse list of Agent-Based Modelling (ABM) applications using Repast will be presented with particular emphasis on spatial applications utilizing Repast, in particular, those that utilize geospatial data
Conceptual multi-agent system design for distributed scheduling systems
With the progressive increase in the complexity of dynamic environments, systems require an
evolutionary configuration and optimization to meet the increased demand. In this sense, any
change in the conditions of systems or products may require distributed scheduling and resource
allocation of more elementary services. Centralized approaches might fall into bottleneck issues,
becoming complex to adapt, especially in case of unexpected events. Thus, Multi-agent systems
(MAS) can extract their automatic and autonomous behaviour to enhance the task effort
distribution and support the scheduling decision-making. On the other hand, MAS is able to
obtain quick solutions, through cooperation and smart control by agents, empowered by their
coordination and interoperability. By leveraging an architecture that benefits of a collaboration
with distributed artificial intelligence, it is proposed an approach based on a conceptual MAS
design that allows distributed and intelligent management to promote technological innovation in
basic concepts of society for more sustainable in everyday applications for domains with
emerging needs, such as, manufacturing and healthcare scheduling systems.This work has been supported by FCT - Fundação para a Ciência e a
Tecnologia within the R&D Units Projects Scope: UIDB/00319/2020 and UIDB/05757/2020.
Filipe Alves is supported by FCT Doctorate Grant Reference SFRH/BD/143745/2019.info:eu-repo/semantics/publishedVersio
Timely Long Tail Identification through Agent Based Monitoring and Analytics
The increasing complexity and scale of distributed systems has resulted in the manifestation of emergent behavior which substantially affects overall system performance. A significant emergent property is that of the "Long Tail", whereby a small proportion of task stragglers significantly impact job execution completion times. To mitigate such behavior, straggling tasks occurring within the system need to be accurately identified in a timely manner. However, current approaches focus on mitigation rather than identification, which typically identify stragglers too late in the execution lifecycle. This paper presents a method and tool to identify Long Tail behavior within distributed systems in a timely manner, through a combination of online and offline analytics. This is achieved through historical analysis to profile and model task execution patterns, which then inform online analytic agents that monitor task execution at runtime. Furthermore, we provide an empirical analysis of two large-scale production Cloud data enters that demonstrate the challenge of data skew within modern distributed systems, this analysis shows that approximately 5% of task stragglers caused by data skew impact 50% of the total jobs for batch processes. Our results demonstrate that our approach is capable of identifying task stragglers less than 11% into their execution lifecycle with 98% accuracy, signifying significant improvement over current state-of-the-art practice and enables far more effective mitigation strategies in large-scale distributed systems worldwide
- …