Search CORE

22,013 research outputs found

A Comparison of Big Data Frameworks on a Layered Dataflow Model

Author: Aldinucci Marco
Drocco Maurizio
Misale Claudia
Tremblay Guy
Publication venue
Publication date: 16/06/2016
Field of study

In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models, for which only informal (and often confusing) semantics is generally provided, all share a common underlying model, namely, the Dataflow model. The Dataflow model we propose shows how various tools share the same expressiveness at different levels of abstraction. The contribution of this work is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm), thus making it easier to understand high-level data-processing applications written in such frameworks. Second, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.Comment: 19 pages, 6 figures, 2 tables, In Proc. of the 9th Intl Symposium on High-Level Parallel Programming and Applications (HLPP), July 4-5 2016, Muenster, German

arXiv.org e-Print Archive

Institutional Research Information System University of Turin

MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!

Author: Lin Jimmy
Publication venue
Publication date: 01/01/2012
Field of study

Hadoop is currently the large-scale data analysis "hammer" of choice, but there exist classes of algorithms that aren't "nails", in the sense that they are not particularly amenable to the MapReduce programming model. To address this, researchers have proposed MapReduce extensions or alternative programming models in which these algorithms can be elegantly expressed. This essay espouses a very different position: that MapReduce is "good enough", and that instead of trying to invent screwdrivers, we should simply get rid of everything that's not a nail. To be more specific, much discussion in the literature surrounds the fact that iterative algorithms are a poor fit for MapReduce: the simple solution is to find alternative non-iterative algorithms that solve the same problem. This essay captures my personal experiences as an academic researcher as well as a software engineer in a "real-world" production analytics environment. From this combined perspective I reflect on the current state and future of "big data" research

arXiv.org e-Print Archive

CiteSeerX

Optimality Properties, Distributed Strategies, and Measurement-Based Evaluation of Coordinated Multicell OFDMA Transmission

Author: Bengtsson Mats
Björnson Emil
Jaldén Niklas
Ottersten Björn
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

The throughput of multicell systems is inherently limited by interference and the available communication resources. Coordinated resource allocation is the key to efficient performance, but the demand on backhaul signaling and computational resources grows rapidly with number of cells, terminals, and subcarriers. To handle this, we propose a novel multicell framework with dynamic cooperation clusters where each terminal is jointly served by a small set of base stations. Each base station coordinates interference to neighboring terminals only, thus limiting backhaul signalling and making the framework scalable. This framework can describe anything from interference channels to ideal joint multicell transmission. The resource allocation (i.e., precoding and scheduling) is formulated as an optimization problem (P1) with performance described by arbitrary monotonic functions of the signal-to-interference-and-noise ratios (SINRs) and arbitrary linear power constraints. Although (P1) is non-convex and difficult to solve optimally, we are able to prove: 1) Optimality of single-stream beamforming; 2) Conditions for full power usage; and 3) A precoding parametrization based on a few parameters between zero and one. These optimality properties are used to propose low-complexity strategies: both a centralized scheme and a distributed version that only requires local channel knowledge and processing. We evaluate the performance on measured multicell channels and observe that the proposed strategies achieve close-to-optimal performance among centralized and distributed solutions, respectively. In addition, we show that multicell interference coordination can give substantial improvements in sum performance, but that joint transmission is very sensitive to synchronization errors and that some terminals can experience performance degradations.Comment: Published in IEEE Transactions on Signal Processing, 15 pages, 7 figures. This version corrects typos related to Eq. (4) and Eq. (28

arXiv.org e-Print Archive

Publikationer från KTH

Crossref

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Open Repository and Bibliography - Luxembourg

MODIS-HIRIS ground data systems commonality report

Author: Ardanuy P.
Han D.
Hoyt D.
Hurley E.
Jaffin S.
Mckay A.
Ormsby J.
Ramapriyan H.
Salomonson V.
Vallette B.
Publication venue
Publication date
Field of study

The High Resolution Imaging Spectrometer (HIRIS) and Moderate Resolution Imaging Spectrometer (MODIS) Data Systems Working Group was formed in September 1988 with representatives of the MODIS Data System Study Group and the HIRIS Project Data System Design Group to collaborate in the development of requirements on the EosDIS necessary to meet the science objectives of the two facility instruments. A major objective was to identify and promote commonality between the HIRIS and MODIS data systems, especially from the science users' point of view. A goal was to provide a base set of joint requirements and specifications which could easily be expanded to a Phase-B representation of the needs of the science users of all EOS instruments. This document describes the points of commonality and difference between the Level-II Requirements, Operations Concepts, and Systems Specifications for the ground data systems for the MODIS and HIRIS instruments at their present state of development

NASA Technical Reports Server

GraphMP: An Efficient Semi-External-Memory Big Graph Processing System on a Single Machine

Author: Duong Ta Nguyen Binh
Sun Peng
Wen Yonggang
Xiao Xiaokui
Publication venue
Publication date: 09/07/2017
Field of study

Recent studies showed that single-machine graph processing systems can be as highly competitive as cluster-based approaches on large-scale problems. While several out-of-core graph processing systems and computation models have been proposed, the high disk I/O overhead could significantly reduce performance in many practical cases. In this paper, we propose GraphMP to tackle big graph analytics on a single machine. GraphMP achieves low disk I/O overhead with three techniques. First, we design a vertex-centric sliding window (VSW) computation model to avoid reading and writing vertices on disk. Second, we propose a selective scheduling method to skip loading and processing unnecessary edge shards on disk. Third, we use a compressed edge cache mechanism to fully utilize the available memory of a machine to reduce the amount of disk accesses for edges. Extensive evaluations have shown that GraphMP could outperform state-of-the-art systems such as GraphChi, X-Stream and GridGraph by 31.6x, 54.5x and 23.1x respectively, when running popular graph applications on a billion-vertex graph

arXiv.org e-Print Archive

Crossref