17 research outputs found
Actors vs Shared Memory: two models at work on Big Data application frameworks
This work aims at analyzing how two different concurrency models, namely the
shared memory model and the actor model, can influence the development of
applications that manage huge masses of data, distinctive of Big Data
applications. The paper compares the two models by analyzing a couple of
concrete projects based on the MapReduce and Bulk Synchronous Parallel
algorithmic schemes. Both projects are doubly implemented on two concrete
platforms: Akka Cluster and Managed X10. The result is both a conceptual
comparison of models in the Big Data Analytics scenario, and an experimental
analysis based on concrete executions on a cluster platform
Scheduling MapReduce Jobs and Data Shuffle on Unrelated Processors
We propose constant approximation algorithms for generalizations of the
Flexible Flow Shop (FFS) problem which form a realistic model for
non-preemptive scheduling in MapReduce systems. Our results concern the
minimization of the total weighted completion time of a set of MapReduce jobs
on unrelated processors and improve substantially on the model proposed by
Moseley et al. (SPAA 2011) in two directions. First, we consider each job
consisting of multiple Map and Reduce tasks, as this is the key idea behind
MapReduce computations, and we propose a constant approximation algorithm.
Then, we introduce into our model the crucial cost of data shuffle phase, i.e.,
the cost for the transmission of intermediate data from Map to Reduce tasks. In
fact, we model this phase by an additional set of Shuffle tasks for each job
and we manage to keep the same approximation ratio when they are scheduled on
the same processors with the corresponding Reduce tasks and to provide also a
constant ratio when they are scheduled on different processors. This is the
most general setting of the FFS problem (with a special third stage) for which
a constant approximation ratio is known
Big Data Management Challenges, Approaches, Tools and their limitations
International audienceBig Data is the buzzword everyone talks about. Independently of the application domain, today there is a consensus about the V's characterizing Big Data: Volume, Variety, and Velocity. By focusing on Data Management issues and past experiences in the area of databases systems, this chapter examines the main challenges involved in the three V's of Big Data. Then it reviews the main characteristics of existing solutions for addressing each of the V's (e.g., NoSQL, parallel RDBMS, stream data management systems and complex event processing systems). Finally, it provides a classification of different functions offered by NewSQL systems and discusses their benefits and limitations for processing Big Data
Communication Steps for Parallel Query Processing
We consider the problem of computing a relational query on a large input
database of size , using a large number of servers. The computation is
performed in rounds, and each server can receive only
bits of data, where is a parameter that controls
replication. We examine how many global communication steps are needed to
compute . We establish both lower and upper bounds, in two settings. For a
single round of communication, we give lower bounds in the strongest possible
model, where arbitrary bits may be exchanged; we show that any algorithm
requires , where is the fractional vertex
cover of the hypergraph of . We also give an algorithm that matches the
lower bound for a specific class of databases. For multiple rounds of
communication, we present lower bounds in a model where routing decisions for a
tuple are tuple-based. We show that for the class of tree-like queries there
exists a tradeoff between the number of rounds and the space exponent
. The lower bounds for multiple rounds are the first of their
kind. Our results also imply that transitive closure cannot be computed in O(1)
rounds of communication
Modelli di programmazione scalabile per Big Data: analisi comparativa e sperimentale
Analisi di alcuni modelli di programmazione scalabile per Big Data, presentando i principali framework in base al tipo di elaborazione: batch, stream, ibrida. Infine sperimentazione e confronti tra Apache Hadoop e Apache Spark, sia in modalità single-node che su cluster AWS