109 research outputs found

    Exploring and Evaluating the Scalability and Efficiency of Apache Spark using Educational Datasets

    Get PDF
    Research into the combination of data mining and machine learning technology with web-based education systems (known as education data mining, or EDM) is becoming imperative in order to enhance the quality of education by moving beyond traditional methods. With the worldwide growth of the Information Communication Technology (ICT), data are becoming available at a significantly large volume, with high velocity and extensive variety. In this thesis, four popular data mining methods are applied to Apache Spark, using large volumes of datasets from Online Cognitive Learning Systems to explore the scalability and efficiency of Spark. Various volumes of datasets are tested on Spark MLlib with different running configurations and parameter tunings. The thesis convincingly presents useful strategies for allocating computing resources and tuning to take full advantage of the in-memory system of Apache Spark to conduct the tasks of data mining and machine learning. Moreover, it offers insights that education experts and data scientists can use to manage and improve the quality of education, as well as to analyze and discover hidden knowledge in the era of big data

    Geometric Approaches to Big Data Modeling and Performance Prediction

    Get PDF
    Big Data frameworks (e.g., Spark) have many configuration parameters, such as memory size, CPU allocation, and the number of nodes (parallelism). Regular users and even expert administrators struggle to understand the relationship between different parameter configurations and the overall performance of the system. In this work, we address this challenge by proposing a performance prediction framework to build performance models with varied configurable parameters on Spark. We take inspiration from the field of Computational Geometry to construct a d-dimensional mesh using Delaunay Triangulation over a selected set of features. From this mesh, we predict execution time for unknown feature configurations. To minimize the time and resources spent in building a model, we propose an adaptive sampling technique to allow us to collect as few training points as required. Our evaluation on a cluster of computers using several workloads shows that our prediction error is lower than the state-of-art methods while having fewer samples to train

    ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ์ž์› ํšจ์œจ์ ์ธ ์ˆ˜ํ–‰์„ ์œ„ํ•œ ๋™์  ์ตœ์ ํ™” ๊ธฐ์ˆ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ์ „๋ณ‘๊ณค.Machine Learning(ML) systems are widely used to extract insights from data. Ever increasing dataset sizes and model complexity gave rise to many efforts towards ef๏ฌcient distributed machine learning systems. One of the popular approaches to support large scale data and complicated models is the parameter server (PS) approach. In this approach, a training job runs with distributed worker and server tasks, where workers iteratively compute gradients to update the global model parameters that are kept in servers. To improve the PS system performance, this dissertation proposes two solutions that automatically optimize resource ef๏ฌciency and system performance. First, we propose a solution that optimizes the resource con๏ฌguration and workload partitioning of distributed ML training on PS system. To ๏ฌnd the best con๏ฌguration, we build an Optimizer based on a cost model that works with online metrics. To ef๏ฌciently apply decisions by Optimizer, we design our runtime elastic to perform recon๏ฌguration in the background with minimal overhead. The second solution optimizes the scheduling of resources and tasks of multiple ML training jobs in a shared cluster. Speci๏ฌcally, we co-locate jobs with complementary resource use to increase resource utilization, while executing their tasks with ๏ฌne-grained unit to avoid resource contention. To alleviate memory pressure by co-located jobs, we enable dynamic spill/reload of data, which adaptively changes the ratio of data between disk and memory. We build a working system that implements our approaches. The above two solutions are implemented in the same system and share the runtime part that can dynamically migrate jobs between machines and reallocate machine resources. We evaluate our system with popular ML applications to verify the effectiveness of our solutions.๊ธฐ๊ณ„ ํ•™์Šต ์‹œ์Šคํ…œ์€ ๋ฐ์ดํ„ฐ์— ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ๋ฅผ ๋ฝ‘์•„๋‚ด๊ธฐ ์œ„ํ•ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ์™€ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๊ฐ€ ์–ด๋Š๋•Œ๋ณด๋‹ค ์ปค์ง์— ๋”ฐ๋ผ ํšจ์œจ์ ์ธ ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต ์‹œ์Šคํ…œ์„์œ„ํ•œ ๋งŽ์€ ๋…ธ๋ ฅ๋“ค์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„ ๋ฐฉ์‹์€ ๊ฑฐ๋Œ€ํ•œ ์Šค์ผ€์ผ์˜ ๋ฐ์ดํ„ฐ์™€ ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•œ ์œ ๋ช…ํ•œ ๋ฐฉ๋ฒ•๋“ค ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด ๋ฐฉ์‹์—์„œ, ํ•™์Šต ์ž‘์—…์€ ๋ถ„์‚ฐ ์›Œ์ปค์™€ ์„œ๋ฒ„๋“ค๋กœ ๊ตฌ์„ฑ๋˜๊ณ , ์›Œ์ปค๋“ค์€ ํ• ๋‹น๋œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ทธ๋ ˆ๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์„œ๋ฒ„๋“ค์— ๋ณด๊ด€๋œ ๊ธ€๋กœ๋ฒŒ ๋ชจ๋ธ ํŒŒ ๋ผ๋ฏธํ„ฐ๋“ค์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ž๋™์ ์œผ๋กœ ์ž์› ํšจ์œจ์„ฑ๊ณผ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋‘๊ฐ€์ง€์˜ ํ•ด๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ๋ฒˆ์งธ ํ•ด๋ฒ•์€, ํŒŒ๋ผ๋ฏธํ„ฐ ์‹œ์Šคํ…œ์—์„œ ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต์„ ์ˆ˜ํ–‰์‹œ์— ์ž์› ์„ค์ • ๋ฐ ์›Œํฌ๋กœ๋“œ ๋ถ„๋ฐฐ๋ฅผ ์ž๋™ํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ตœ๊ณ ์˜ ์„ค์ •์„ ์ฐพ๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ์˜จ๋ผ์ธ ๋ฉ”ํŠธ๋ฆญ์„ ์‚ฌ์šฉํ•˜๋Š” ๋น„์šฉ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” Optimizer๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. Optimizer์˜ ๊ฒฐ์ •์„ ํšจ์œจ์ ์œผ๋กœ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ๋Ÿฐํƒ€์ž„์„ ๋™์  ์žฌ์„ค์ •์„ ์ตœ์†Œ์˜ ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋””์ž์ธํ–ˆ๋‹ค. ๋‘๋ฒˆ์งธ ํ•ด๋ฒ•์€ ๊ณต์œ  ํด๋Ÿฌ์Šคํ„ฐ ์ƒํ™ฉ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ธฐ๊ณ„ ํ•™์Šต ์ž‘์—…์˜ ์„ธ๋ถ€ ์ž‘์—… ๊ณผ ์ž์›์˜ ์Šค์ผ€์ฅด๋ง์„ ์ตœ์ ํ™”ํ•œ ๊ฒƒ์ด๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ์„ธ๋ถ€ ์ž‘์—…๋“ค์„ ์„ธ๋ฐ€ํ•œ ๋‹จ์œ„๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์ž์› ๊ฒฝ์Ÿ์„ ์–ต์ œํ•˜๊ณ , ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜๋Š” ์ž์› ์‚ฌ์šฉ ํŒจํ„ด์„ ๋ณด์ด๋Š” ์ž‘์—…๋“ค์„ ๊ฐ™์€ ์ž์›์— ํ•จ๊ป˜ ์œ„์น˜์‹œ์ผœ ์ž์› ํ™œ์šฉ์œจ์„ ๋Œ์–ด์˜ฌ๋ ธ๋‹ค. ํ•จ๊ป˜ ์œ„์น˜ํ•œ ์ž‘์—…๋“ค์˜ ๋ฉ”๋ชจ๋ฆฌ ์••๋ ฅ์„ ๊ฒฝ๊ฐ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋™์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋””์Šคํฌ๋กœ ๋‚ด๋ ธ๋‹ค๊ฐ€ ๋‹ค์‹œ ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ฝ์–ด์˜ค๋Š” ๊ธฐ๋Šฅ์„ ์ง€์›ํ•จ๊ณผ ๋™์‹œ์—, ๋””์Šคํฌ์™€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ๋น„์œจ์„ ์ƒํ™ฉ์— ๋งž๊ฒŒ ์‹œ์Šคํ…œ์ด ์ž๋™์œผ๋กœ ๋งž์ถ”๋„๋ก ํ•˜์˜€๋‹ค. ์œ„์˜ ํ•ด๋ฒ•๋“ค์„ ์‹ค์ฒดํ™”ํ•˜๊ธฐ ์œ„ํ•ด, ์‹ค์ œ ๋™์ž‘ํ•˜๋Š” ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค์—ˆ๋‹ค. ๋‘๊ฐ€์ง€์˜ ํ•ด๋ฒ•์„ ํ•˜๋‚˜์˜ ์‹œ์Šคํ…œ์— ๊ตฌํ˜„ํ•จ์œผ๋กœ์จ, ๋™์ ์œผ๋กœ ์ž‘์—…์„ ๋จธ์‹  ๊ฐ„์— ์˜ฎ๊ธฐ๊ณ  ์ž์›์„ ์žฌํ• ๋‹นํ•  ์ˆ˜ ์žˆ๋Š” ๋Ÿฐํƒ€์ž„์„ ๊ณต์œ ํ•œ๋‹ค. ํ•ด๋‹น ์†”๋ฃจ์…˜๋“ค์˜ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด, ์ด ์‹œ์Šคํ…œ์„ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ๊ณ„ ํ•™์Šต ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์œผ๋กœ ์‹คํ—˜ํ•˜์˜€๊ณ  ๊ธฐ์กด ์‹œ์Šคํ…œ๋“ค ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.Chapter1. Introduction 1 1.1 Distributed Machine Learning on Parameter Servers 1 1.2 Automating System Conguration of Distributed Machine Learning 2 1.3 Scheduling of Multiple Distributed Machine Learning Jobs 3 1.4 Contributions 5 1.5 Dissertation Structure 6 Chapter2. Background 7 Chapter3. Automating System Conguration of Distributed Machine Learning 10 3.1 System Conguration Challenges 11 3.2 Finding Good System Conguration 13 3.2.1 Cost Model 13 3.2.2 Cost Formulation 15 3.2.3 Optimization 16 3.3 Cruise 18 3.3.1 Optimizer 19 3.3.2 Elastic Runtime 21 3.4 Evaluation 26 3.4.1 Experimental Setup 26 3.4.2 Finding Baselines with Grid Search 28 3.4.3 Optimization in the Homogeneous Environment 28 3.4.4 Utilizing Opportunistic Resources 30 3.4.5 Optimization in the Heterogeneous Environment 31 3.4.6 Reconguration Speed 32 3.5 Related Work 33 3.6 Summary 34 Chapter4 A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs 36 4.1 Resource Under-utilization Problems in PS ML Training 37 4.2 Harmony Overview 42 4.3 Multiplexing ML Jobs 43 4.3.1 Fine-grained Execution with Subtasks 44 4.3.2 Dynamic Grouping of Jobs 45 4.3.3 Dynamic Data Reloading 54 4.4 Evaluation 56 4.4.1 Baselines 56 4.4.2 Experimental Setup 57 4.4.3 Performance Comparison 59 4.4.4 Performance Breakdown 59 4.4.5 Workload Sensitivity Analysis 61 4.4.6 Accuracy of the Performance Model 63 4.4.7 Performance and Scalability of the Scheduling Algorithm 64 4.4.8 Dynamic Data Reloading 66 4.5 Discussion 67 4.6 Related Work 67 4.7 Summary 70 Chapter5 Conclusion 71 5.1 Summary 71 5.2 Future Work 71 5.2.1 Other Communication Architecture Support 71 5.2.2 Deep Learning & GPU Resource Support 72 ์š”์•ฝ 81Docto

    An Intermediate Data-driven Methodology for Scientific Workflow Management System to Support Reusability

    Get PDF
    Automatic processing of different logical sub-tasks by a set of rules is a workflow. A workflow management system (WfMS) is a system that helps us accomplish a complex scientific task through making a sequential arrangement of sub-tasks available as tools. Workflows are formed with modules from various domains in a WfMS, and many collaborators of the domains are involved in the workflow design process. Workflow Management Systems (WfMSs) have been gained popularity in recent years for managing various tools in a system and ensuring dependencies while building a sequence of executions for scientific analyses. As a result of heterogeneous tools involvement and collaboration requirement, Collaborative Scientific Workflow Management Systems (CSWfMS) have gained significant interest in the scientific analysis community. In such systems, big data explosion issues exist with massive velocity and variety characteristics for the heterogeneous large amount of data from different domains. Therefore a large amount of heterogeneous data need to be managed in a Scientific Workflow Management System (SWfMS) with a proper decision mechanism. Although a number of studies addressed the cost management of data, none of the existing studies are related to real- time decision mechanism or reusability mechanism. Besides, frequent execution of workflows in a SWfMS generates a massive amount of data and characteristics of such data are always incremental. Input data or module outcomes of a workflow in a SWfMS are usually large in size. Processing of such data-intensive workflows is usually time-consuming where modules are computationally expensive for their respective inputs. Besides, lack of data reusability, limitation of error recovery, inefficient workflow processing, inefficient storing of derived data, lacking in metadata association and lacking in validation of the effectiveness of a technique of existing systems need to be addressed in a SWfMS for efficient workflow building by maintaining the big data explosion. To address the issues, in this thesis first we propose an intermediate data management scheme for a SWfMS. In our second attempt, we explored the possibilities and introduced an automatic recommendation technique for a SWfMS from real-world workflow data (i.e Galaxy [1] workflows) where our investigations show that the proposed technique can facilitate 51% of workflow building in a SWfMS by reusing intermediate data of previous workflows and can reduce 74% execution time of workflow buildings in a SWfMS. Later we propose an adaptive version of our technique by considering the states of tools in a SWfMS, which shows around 40% reusability for workflows. Consequently, in our fourth study, We have done several experiments for analyzing the performance and exploring the effectiveness of the technique in a SWfMS for various environments. The technique is introduced to emphasize on storing cost reduction, increase data reusability, and faster workflow execution, to the best of our knowledge, which is the first of its kind. Detail architecture and evaluation of the technique are presented in this thesis. We believe our findings and developed system will contribute significantly to the research domain of SWfMSs

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 โ€œHigh-Performance Modelling and Simulation for Big Data Applications (cHiPSet)โ€œ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

    Reinventing the Social Scientist and Humanist in the Era of Big Data

    Get PDF
    This book explores the big data evolution by interrogating the notion that big data is a disruptive innovation that appears to be challenging existing epistemologies in the humanities and social sciences. Exploring various (controversial) facets of big data such as ethics, data power, and data justice, the book attempts to clarify the trajectory of the epistemology of (big) data-driven science in the humanities and social sciences

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 โ€œHigh-Performance Modelling and Simulation for Big Data Applications (cHiPSet)โ€œ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

    Enabling Model-Driven Live Analytics For Cyber-Physical Systems: The Case of Smart Grids

    Get PDF
    Advances in software, embedded computing, sensors, and networking technologies will lead to a new generation of smart cyber-physical systems that will far exceed the capabilities of todayโ€™s embedded systems. They will be entrusted with increasingly complex tasks like controlling electric grids or autonomously driving cars. These systems have the potential to lay the foundations for tomorrowโ€™s critical infrastructures, to form the basis of emerging and future smart services, and to improve the quality of our everyday lives in many areas. In order to solve their tasks, they have to continuously monitor and collect data from physical processes, analyse this data, and make decisions based on it. Making smart decisions requires a deep understanding of the environment, internal state, and the impacts of actions. Such deep understanding relies on efficient data models to organise the sensed data and on advanced analytics. Considering that cyber-physical systems are controlling physical processes, decisions need to be taken very fast. This makes it necessary to analyse data in live, as opposed to conventional batch analytics. However, the complex nature combined with the massive amount of data generated by such systems impose fundamental challenges. While data in the context of cyber-physical systems has some similar characteristics as big data, it holds a particular complexity. This complexity results from the complicated physical phenomena described by this data, which makes it difficult to extract a model able to explain such data and its various multi-layered relationships. Existing solutions fail to provide sustainable mechanisms to analyse such data in live. This dissertation presents a novel approach, named model-driven live analytics. The main contribution of this thesis is a multi-dimensional graph data model that brings raw data, domain knowledge, and machine learning together in a single model, which can drive live analytic processes. This model is continuously updated with the sensed data and can be leveraged by live analytic processes to support decision-making of cyber-physical systems. The presented approach has been developed in collaboration with an industrial partner and, in form of a prototype, applied to the domain of smart grids. The addressed challenges are derived from this collaboration as a response to shortcomings in the current state of the art. More specifically, this dissertation provides solutions for the following challenges: First, data handled by cyber-physical systems is usually dynamicโ€”data in motion as opposed to traditional data at restโ€”and changes frequently and at different paces. Analysing such data is challenging since data models usually can only represent a snapshot of a system at one specific point in time. A common approach consists in a discretisation, which regularly samples and stores such snapshots at specific timestamps to keep track of the history. Continuously changing data is then represented as a finite sequence of such snapshots. Such data representations would be very inefficient to analyse, since it would require to mine the snapshots, extract a relevant dataset, and finally analyse it. For this problem, this thesis presents a temporal graph data model and storage system, which consider time as a first-class property. A time-relative navigation concept enables to analyse frequently changing data very efficiently. Secondly, making sustainable decisions requires to anticipate what impacts certain actions would have. Considering complex cyber-physical systems, it can come to situations where hundreds or thousands of such hypothetical actions must be explored before a solid decision can be made. Every action leads to an independent alternative from where a set of other actions can be applied and so forth. Finding the sequence of actions that leads to the desired alternative, requires to efficiently create, represent, and analyse many different alternatives. Given that every alternative has its own history, this creates a very high combinatorial complexity of alternatives and histories, which is hard to analyse. To tackle this problem, this dissertation introduces a multi-dimensional graph data model (as an extension of the temporal graph data model) that enables to efficiently represent, store, and analyse many different alternatives in live. Thirdly, complex cyber-physical systems are often distributed, but to fulfil their tasks these systems typically need to share context information between computational entities. This requires analytic algorithms to reason over distributed data, which is a complex task since it relies on the aggregation and processing of various distributed and constantly changing data. To address this challenge, this dissertation proposes an approach to transparently distribute the presented multi-dimensional graph data model in a peer-to-peer manner and defines a stream processing concept to efficiently handle frequent changes. Fourthly, to meet future needs, cyber-physical systems need to become increasingly intelligent. To make smart decisions, these systems have to continuously refine behavioural models that are known at design time, with what can only be learned from live data. Machine learning algorithms can help to solve this unknown behaviour by extracting commonalities over massive datasets. Nevertheless, searching a coarse-grained common behaviour model can be very inaccurate for cyber-physical systems, which are composed of completely different entities with very different behaviour. For these systems, fine-grained learning can be significantly more accurate. However, modelling, structuring, and synchronising many fine-grained learning units is challenging. To tackle this, this thesis presents an approach to define reusable, chainable, and independently computable fine-grained learning units, which can be modelled together with and on the same level as domain data. This allows to weave machine learning directly into the presented multi-dimensional graph data model. In summary, this thesis provides an efficient multi-dimensional graph data model to enable live analytics of complex, frequently changing, and distributed data of cyber-physical systems. This model can significantly improve data analytics for such systems and empower cyber-physical systems to make smart decisions in live. The presented solutions combine and extend methods from model-driven engineering, [email protected], data analytics, database systems, and machine learning

    Adaptive Failure-Aware Scheduling for Hadoop

    Get PDF
    Given the dynamic nature of cloud environments, failures are the norm rather than the exception in data centers powering cloud frameworks. Despite the diversity of integrated recovery mechanisms in cloud frameworks, their schedulers still generate poor scheduling decisions leading to tasks' failures due to unforeseen events such as unpredicted demands of services or hardware outages. Traditionally, simulation and analytical modeling have been widely used to analyze the impact of the scheduling decisions on the failures rates. However, they cannot provide accurate results and exhaustive coverage of the cloud systems especially when failures occur. In this thesis, we present new approaches for modeling and verifying an adaptive failure-aware scheduling algorithm for Hadoop to early detect these failures and to reschedule tasks according to changes in the cloud. Hadoop is the framework of choice on many off-the-shelf clusters in the cloud to process data-intensive applications by efficiently running them across distributed multiple machines. The proposed scheduling algorithm for Hadoop relies on predictions made by machine learning algorithms trained on previously executed tasks and data collected from the Hadoop environment. To further improve Hadoop scheduling decisions on the fly, we use reinforcement learning techniques to select an appropriate scheduling action for a scheduled task. Furthermore, we propose an adaptive algorithm to dynamically detect failures of nodes in Hadoop. We implement the above approaches in ATLAS: an AdapTive Failure-Aware Scheduling algorithm that can be built on top of existing Hadoop schedulers. To illustrate the usefulness and benefits of ATLAS, we conduct a large empirical study on a Hadoop cluster deployed on Amazon Elastic MapReduce (EMR) to compare the performance of ATLAS to those of three Hadoop scheduling algorithms (FIFO, Fair, and Capacity). Results show that ATLAS outperforms these scheduling algorithms in terms of failures' rates, execution times, and resources utilization. Finally, we propose a new methodology to formally identify the impact of the scheduling decisions of Hadoop on the failures rates. We use model checking to verify some of the most important scheduling properties in Hadoop (schedulability, resources-deadlock freeness, and fairness) and provide possible strategies to avoid their occurrences in ATLAS. The formal verification of the Hadoop scheduler allows to identify more tasks failures and hence reduce the number of failures in ATLAS

    Network-Wide Monitoring And Debugging

    Get PDF
    Modern networks can encompass over 100,000 servers. Managing such an extensive network with a diverse set of network policies has become more complicated with the introduction of programmable hardwares and distributed network functions. Furthermore, service level agreements (SLAs) require operators to maintain high performance and availability with low latencies. Therefore, it is crucial for operators to resolve any issues in networks quickly. The problems can occur at any layer of stack: network (load imbalance), data-plane (incorrect packet processing), control-plane (bugs in configuration) and the coordination among them. Unfortunately, existing debugging tools are not sufficient to monitor, analyze, or debug modern networks; either they lack visibility in the network, require manual analysis, or cannot check for some properties. These limitations arise from the outdated view of the networks, i.e., that we can look at a single component in isolation. In this thesis, we describe a new approach that looks at measuring, understanding, and debugging the network across devices and time. We also target modern stateful packet processing devices: programmable data-planes and distributed network functions as these becoming increasingly common part of the network. Our key insight is to leverage both in-network packet processing (to collect precise measurements) and out-of-network processing (to coordinate measurements and scale analytics). The resulting systems we design based on this approach can support testing and monitoring at the data center scale, and can handle stateful data in the network. We automate the collection and analysis of measurement data to save operator time and take a step towards self driving networks
    • โ€ฆ
    corecore