Search CORE

1,409 research outputs found

A Game-Theoretic Approach for Runtime Capacity Allocation in MapReduce

Author: Ardagna Danilo
Ciavotta Michele
Gianniti Eugenio
Passacantando Mauro
Publication venue
Publication date: 01/01/2017
Field of study

Nowadays many companies have available large amounts of raw, unstructured data. Among Big Data enabling technologies, a central place is held by the MapReduce framework and, in particular, by its open source implementation, Apache Hadoop. For cost effectiveness considerations, a common approach entails sharing server clusters among multiple users. The underlying infrastructure should provide every user with a fair share of computational resources, ensuring that Service Level Agreements (SLAs) are met and avoiding wastes. In this paper we consider two mathematical programming problems that model the optimal allocation of computational resources in a Hadoop 2.x cluster with the aim to develop new capacity allocation techniques that guarantee better performance in shared data centers. Our goal is to get a substantial reduction of power consumption while respecting the deadlines stated in the SLAs and avoiding penalties associated with job rejections. The core of this approach is a distributed algorithm for runtime capacity allocation, based on Game Theory models and techniques, that mimics the MapReduce dynamics by means of interacting players, namely the central Resource Manager and Class Managers

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Archivio della Ricerca - Università di Pisa

Recommended from our members

MapReduce based RDF assisted distributed SVM for high throughput spam filtering

Author: Caruana Godwin
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2013
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses. Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart. Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure. The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin

Brunel University Research Archive

Performance modelling and optimization for video-analytic algorithms in a cloud-like environment using machine learning

Author: Manal N.K. Al-Rawahi (7168808)
Publication venue
Publication date: 01/01/2016
Field of study

CCTV cameras produce a large amount of video surveillance data per day, and analysing them require the use of significant computing resources that often need to be scalable. The emergence of the Hadoop distributed processing framework has had a significant impact on various data intensive applications as the distributed computed based processing enables an increase of the processing capability of applications it serves. Hadoop is an open source implementation of the MapReduce programming model. It automates the operation of creating tasks for each function, distribute data, parallelize executions and handles machine failures that reliefs users from the complexity of having to manage the underlying processing and only focus on building their application. It is noted that in a practical deployment the challenge of Hadoop based architecture is that it requires several scalable machines for effective processing, which in turn adds hardware investment cost to the infrastructure. Although using a cloud infrastructure offers scalable and elastic utilization of resources where users can scale up or scale down the number of Virtual Machines (VM) upon requirements, a user such as a CCTV system operator intending to use a public cloud would aspire to know what cloud resources (i.e. number of VMs) need to be deployed so that the processing can be done in the fastest (or within a known time constraint) and the most cost effective manner. Often such resources will also have to satisfy practical, procedural and legal requirements. The capability to model a distributed processing architecture where the resource requirements can be effectively and optimally predicted will thus be a useful tool, if available. In literature there is no clear and comprehensive modelling framework that provides proactive resource allocation mechanisms to satisfy a user's target requirements, especially for a processing intensive application such as video analytic. In this thesis, with the hope of closing the above research gap, novel research is first initiated by understanding the current legal practices and requirements of implementing video surveillance system within a distributed processing and data storage environment, since the legal validity of data gathered or processed within such a system is vital for a distributed system's applicability in such domains. Subsequently the thesis presents a comprehensive framework for the performance ii modelling and optimization of resource allocation in deploying a scalable distributed video analytic application in a Hadoop based framework, running on virtualized cluster of machines. The proposed modelling framework investigates the use of several machine learning algorithms such as, decision trees (M5P, RepTree), Linear Regression, Multi Layer Perceptron(MLP) and the Ensemble Classifier Bagging model, to model and predict the execution time of video analytic jobs, based on infrastructure level as well as job level parameters. Further in order to propose a novel framework for the allocate resources under constraints to obtain optimal performance in terms of job execution time, we propose a Genetic Algorithms (GAs) based optimization technique. Experimental results are provided to demonstrate the proposed framework's capability to successfully predict the job execution time of a given video analytic task based on infrastructure and input data related parameters and its ability determine the minimum job execution time, given constraints of these parameters. Given the above, the thesis contributes to the state-of-art in distributed video analytics, design, implementation, performance analysis and optimisation

Loughborough University Institutional Repository

D-SPACE4Cloud: A Design Tool for Big Data Applications

Author: A Aleti
A Castiglione
A Verma
E Vianna
ED Lazowska
F Brosig
HV Jagadish
K Kambatla
KH Lee
M Bertoli
M Malekimajd
M Tribastone
MR Garey
S Becker
W Zhang
Z Zhang
Publication venue
Publication date: 01/01/2016
Field of study

The last years have seen a steep rise in data generation worldwide, with the development and widespread adoption of several software projects targeting the Big Data paradigm. Many companies currently engage in Big Data analytics as part of their core business activities, nonetheless there are no tools and techniques to support the design of the underlying hardware configuration backing such systems. In particular, the focus in this report is set on Cloud deployed clusters, which represent a cost-effective alternative to on premises installations. We propose a novel tool implementing a battery of optimization and prediction techniques integrated so as to efficiently assess several alternative resource configurations, in order to determine the minimum cost cluster deployment satisfying QoS constraints. Further, the experimental campaign conducted on real systems shows the validity and relevance of the proposed method

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

Performance Improvement of Distributed Computing Framework and Scientific Big Data Analysis

Author: Kondikoppa Praveenkumar
Publication venue: LSU Digital Commons
Publication date: 01/01/2014
Field of study

Analysis of Big data to gain better insights has been the focus of researchers in the recent past. Traditional desktop computers or database management systems may not be suitable for efficient and timely analysis, due to the requirement of massive parallel processing. Distributed computing frameworks are being explored as a viable solution. For example, Google proposed MapReduce, which is becoming a de facto computing architecture for Big data solutions. However, scheduling in MapReduce is coarse grained and remains as a challenge for improvement. Related with MapReduce scheduler when configured over distributed clusters, we identify two issues: data locality disruption and random assignment of non-local map tasks. We propose a network aware scheduler to extend the existing rack awareness. The tasks are scheduled in the order of node, rack and any other rack within the same cluster to achieve cluster level data locality. The issue of random assignment non-local map tasks is handled by enhancing the scheduler to consider the network parameters, such as delay, bandwidth and packet loss between remote clusters. As part of Big data analysis at computational biology, we consider two major data intensive applications: indexing genome sequences and de Novo assembly. Both of these applications deal with the massive amount data generated from DNA sequencers. We developed a scalable algorithm to construct sub-trees of a suffix tree in parallel to address huge memory requirements needed for indexing the human genome. For the de Novo assembly, we propose Parallel Giraph based Assembler (PGA) to address the challenges associated with the assembly of large genomes over commodity hardware. PGA uses the de Bruijn graph to represent the data generated from sequencers. Huge memory demands and performance expectations are addressed by developing parallel algorithms based on the distributed graph-processing framework, Apache Giraph

Louisiana State University

Big Data in the Cloud: A Survey

Author: Jorge Bernardino
Pedro Caldeira Neves
Publication venue: RonPub
Publication date: 01/01/2015
Field of study

Big Data has become a hot topic across several business areas requiring the storage and processing of huge volumes of data. Cloud computing leverages Big Data by providing high storage and processing capabilities and enables corporations to consume resources in a pay-as-you-go model making clouds the optimal environment for storing and processing huge quantities of data. By using virtualized resources, Cloud can scale very easily, be highly available and provide massive storage capacity and processing power. This paper surveys existing databases models to store and process Big Data within a Cloud environment. Particularly, we detail the following traditional NoSQL databases: BigTable, Cassandra, DynamoDB, HBase, Hypertable, and MongoDB. The MapReduce framework and its developments Apache Spark, HaLoop, Twister, and other alternatives such as Apache Giraph, GraphLab, Pregel and MapD - a novel platform that uses GPU processing to accelerate Big Data processing - are also analyzed. Finally, we present two case studies that demonstrate the successful use of Big Data within Cloud environments and the challenges that must be addressed in the future

RonPub -- Research Online Publishing