22,013 research outputs found
A Comparison of Big Data Frameworks on a Layered Dataflow Model
In the world of Big Data analytics, there is a series of tools aiming at
simplifying programming applications to be executed on clusters. Although each
tool claims to provide better programming, data and execution models, for which
only informal (and often confusing) semantics is generally provided, all share
a common underlying model, namely, the Dataflow model. The Dataflow model we
propose shows how various tools share the same expressiveness at different
levels of abstraction. The contribution of this work is twofold: first, we show
that the proposed model is (at least) as general as existing batch and
streaming frameworks (e.g., Spark, Flink, Storm), thus making it easier to
understand high-level data-processing applications written in such frameworks.
Second, we provide a layered model that can represent tools and applications
following the Dataflow paradigm and we show how the analyzed tools fit in each
level.Comment: 19 pages, 6 figures, 2 tables, In Proc. of the 9th Intl Symposium on
High-Level Parallel Programming and Applications (HLPP), July 4-5 2016,
Muenster, German
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
Optimality Properties, Distributed Strategies, and Measurement-Based Evaluation of Coordinated Multicell OFDMA Transmission
The throughput of multicell systems is inherently limited by interference and
the available communication resources. Coordinated resource allocation is the
key to efficient performance, but the demand on backhaul signaling and
computational resources grows rapidly with number of cells, terminals, and
subcarriers. To handle this, we propose a novel multicell framework with
dynamic cooperation clusters where each terminal is jointly served by a small
set of base stations. Each base station coordinates interference to neighboring
terminals only, thus limiting backhaul signalling and making the framework
scalable. This framework can describe anything from interference channels to
ideal joint multicell transmission.
The resource allocation (i.e., precoding and scheduling) is formulated as an
optimization problem (P1) with performance described by arbitrary monotonic
functions of the signal-to-interference-and-noise ratios (SINRs) and arbitrary
linear power constraints. Although (P1) is non-convex and difficult to solve
optimally, we are able to prove: 1) Optimality of single-stream beamforming; 2)
Conditions for full power usage; and 3) A precoding parametrization based on a
few parameters between zero and one. These optimality properties are used to
propose low-complexity strategies: both a centralized scheme and a distributed
version that only requires local channel knowledge and processing. We evaluate
the performance on measured multicell channels and observe that the proposed
strategies achieve close-to-optimal performance among centralized and
distributed solutions, respectively. In addition, we show that multicell
interference coordination can give substantial improvements in sum performance,
but that joint transmission is very sensitive to synchronization errors and
that some terminals can experience performance degradations.Comment: Published in IEEE Transactions on Signal Processing, 15 pages, 7
figures. This version corrects typos related to Eq. (4) and Eq. (28
MODIS-HIRIS ground data systems commonality report
The High Resolution Imaging Spectrometer (HIRIS) and Moderate Resolution Imaging Spectrometer (MODIS) Data Systems Working Group was formed in September 1988 with representatives of the MODIS Data System Study Group and the HIRIS Project Data System Design Group to collaborate in the development of requirements on the EosDIS necessary to meet the science objectives of the two facility instruments. A major objective was to identify and promote commonality between the HIRIS and MODIS data systems, especially from the science users' point of view. A goal was to provide a base set of joint requirements and specifications which could easily be expanded to a Phase-B representation of the needs of the science users of all EOS instruments. This document describes the points of commonality and difference between the Level-II Requirements, Operations Concepts, and Systems Specifications for the ground data systems for the MODIS and HIRIS instruments at their present state of development
GraphMP: An Efficient Semi-External-Memory Big Graph Processing System on a Single Machine
Recent studies showed that single-machine graph processing systems can be as
highly competitive as cluster-based approaches on large-scale problems. While
several out-of-core graph processing systems and computation models have been
proposed, the high disk I/O overhead could significantly reduce performance in
many practical cases. In this paper, we propose GraphMP to tackle big graph
analytics on a single machine. GraphMP achieves low disk I/O overhead with
three techniques. First, we design a vertex-centric sliding window (VSW)
computation model to avoid reading and writing vertices on disk. Second, we
propose a selective scheduling method to skip loading and processing
unnecessary edge shards on disk. Third, we use a compressed edge cache
mechanism to fully utilize the available memory of a machine to reduce the
amount of disk accesses for edges. Extensive evaluations have shown that
GraphMP could outperform state-of-the-art systems such as GraphChi, X-Stream
and GridGraph by 31.6x, 54.5x and 23.1x respectively, when running popular
graph applications on a billion-vertex graph
- …