Search CORE

17 research outputs found

Actors vs Shared Memory: two models at work on Big Data application frameworks

Author: Crafa Silvia
Tronchin Luca
Publication venue
Publication date: 01/01/2015
Field of study

This work aims at analyzing how two different concurrency models, namely the shared memory model and the actor model, can influence the development of applications that manage huge masses of data, distinctive of Big Data applications. The paper compares the two models by analyzing a couple of concrete projects based on the MapReduce and Bulk Synchronous Parallel algorithmic schemes. Both projects are doubly implemented on two concrete platforms: Akka Cluster and Managed X10. The result is both a conceptual comparison of models in the Big Data Analytics scenario, and an experimental analysis based on concrete executions on a cluster platform

arXiv.org e-Print Archive

CiteSeerX

Archivio istituzionale della ricerca - Università di Padova

Scheduling MapReduce Jobs and Data Shuffle on Unrelated Processors

Author: Dimitris Fotakis
Emmanouil Zampetakis
Georgios Zois
Ioannis Milis
Orestis Papadigenopoulos
Publication venue
Publication date: 24/06/2014
Field of study

We propose constant approximation algorithms for generalizations of the Flexible Flow Shop (FFS) problem which form a realistic model for non-preemptive scheduling in MapReduce systems. Our results concern the minimization of the total weighted completion time of a set of MapReduce jobs on unrelated processors and improve substantially on the model proposed by Moseley et al. (SPAA 2011) in two directions. First, we consider each job consisting of multiple Map and Reduce tasks, as this is the key idea behind MapReduce computations, and we propose a constant approximation algorithm. Then, we introduce into our model the crucial cost of data shuffle phase, i.e., the cost for the transmission of intermediate data from Map to Reduce tasks. In fact, we model this phase by an additional set of Shuffle tasks for each job and we manage to keep the same approximation ratio when they are scheduled on the same processors with the corresponding Reduce tasks and to provide also a constant ratio when they are scheduled on different processors. This is the most general setting of the FFS problem (with a special third stage) for which a constant approximation ratio is known

arXiv.org e-Print Archive

Crossref

HAL Descartes

Hal-Diderot

Big Data Management Challenges, Approaches, Tools and their limitations

Author: Adiba Michel
Castrejon-Castillo Juan-Carlos
Espinosa Oviedo Javier Alfonso
Vargas-Solar Genoveva
Zechinelli-Martini José-Luis
Publication venue: Chapman and Hall/CRC
Publication date: 01/02/2016
Field of study

International audienceBig Data is the buzzword everyone talks about. Independently of the application domain, today there is a consensus about the V's characterizing Big Data: Volume, Variety, and Velocity. By focusing on Data Management issues and past experiences in the area of databases systems, this chapter examines the main challenges involved in the three V's of Big Data. Then it reviews the main characteristics of existing solutions for addressing each of the V's (e.g., NoSQL, parallel RDBMS, stream data management systems and complex event processing systems). Finally, it provides a classification of different functions offered by NewSQL systems and discusses their benefits and limitations for processing Big Data

Hal - Université Grenoble Alpes

Communication Steps for Parallel Query Processing

Author: Beame Paul
Koutris Paraschos
Suciu Dan
Publication venue
Publication date: 01/01/2013
Field of study

We consider the problem of computing a relational query

q

on a large input database of size

n

, using a large number

p

of servers. The computation is performed in rounds, and each server can receive only

O(n/p^{1-\varepsilon})

bits of data, where

\varepsilon \in [0,1]

is a parameter that controls replication. We examine how many global communication steps are needed to compute

q

. We establish both lower and upper bounds, in two settings. For a single round of communication, we give lower bounds in the strongest possible model, where arbitrary bits may be exchanged; we show that any algorithm requires

\varepsilon \geq 1-1/\tau^*

, where

\tau^*

is the fractional vertex cover of the hypergraph of

q

. We also give an algorithm that matches the lower bound for a specific class of databases. For multiple rounds of communication, we present lower bounds in a model where routing decisions for a tuple are tuple-based. We show that for the class of tree-like queries there exists a tradeoff between the number of rounds and the space exponent

\varepsilon

. The lower bounds for multiple rounds are the first of their kind. Our results also imply that transitive closure cannot be computed in O(1) rounds of communication

arXiv.org e-Print Archive

CiteSeerX

Crossref

Modelli di programmazione scalabile per Big Data: analisi comparativa e sperimentale

Author: Berti Matteo
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 16/10/2018
Field of study

Analisi di alcuni modelli di programmazione scalabile per Big Data, presentando i principali framework in base al tipo di elaborazione: batch, stream, ibrida. Infine sperimentazione e confronti tra Apache Hadoop e Apache Spark, sia in modalità single-node che su cluster AWS

AMS Tesi di Laurea