289 research outputs found
A Survey on Geographically Distributed Big-Data Processing using MapReduce
Hadoop and Spark are widely used distributed processing frameworks for
large-scale data processing in an efficient and fault-tolerant manner on
private or public clouds. These big-data processing systems are extensively
used by many industries, e.g., Google, Facebook, and Amazon, for solving a
large class of problems, e.g., search, clustering, log analysis, different
types of join operations, matrix multiplication, pattern matching, and social
network analysis. However, all these popular systems have a major drawback in
terms of locally distributed computations, which prevent them in implementing
geographically distributed data processing. The increasing amount of
geographically distributed massive data is pushing industries and academia to
rethink the current big-data processing systems. The novel frameworks, which
will be beyond state-of-the-art architectures and technologies involved in the
current system, are expected to process geographically distributed data at
their locations without moving entire raw datasets to a single location. In
this paper, we investigate and discuss challenges and requirements in designing
geographically distributed data processing frameworks and protocols. We
classify and study batch processing (MapReduce-based systems), stream
processing (Spark-based systems), and SQL-style processing geo-distributed
frameworks, models, and algorithms with their overhead issues.Comment: IEEE Transactions on Big Data; Accepted June 2017. 20 page
knor: A NUMA-Optimized In-Memory, Distributed and Semi-External-Memory k-means Library
k-means is one of the most influential and utilized machine learning
algorithms. Its computation limits the performance and scalability of many
statistical analysis and machine learning tasks. We rethink and optimize
k-means in terms of modern NUMA architectures to develop a novel
parallelization scheme that delays and minimizes synchronization barriers. The
\textit{k-means NUMA Optimized Routine} (\textsf{knor}) library has (i)
in-memory (\textsf{knori}), (ii) distributed memory (\textsf{knord}), and (iii)
semi-external memory (\textsf{knors}) modules that radically improve the
performance of k-means for varying memory and hardware budgets. \textsf{knori}
boosts performance for single machine datasets by an order of magnitude or
more. \textsf{knors} improves the scalability of k-means on a memory budget
using SSDs. \textsf{knors} scales to billions of points on a single machine,
using a fraction of the resources that distributed in-memory systems require.
\textsf{knord} retains \textsf{knori}'s performance characteristics, while
scaling in-memory through distributed computation in the cloud. \textsf{knor}
modifies Elkan's triangle inequality pruning algorithm such that we utilize it
on billion-point datasets without the significant memory overhead of the
original algorithm. We demonstrate \textsf{knor} outperforms distributed
commercial products like HO, Turi (formerly Dato, GraphLab) and Spark's
MLlib by more than an order of magnitude for datasets of to
points
Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs
We carry out a comparative performance study of multi-core CPUs, GPUs and
Intel Xeon Phi (Many Integrated Core - MIC) with a microscopy image analysis
application. We experimentally evaluate the performance of computing devices on
core operations of the application. We correlate the observed performance with
the characteristics of computing devices and data access patterns, computation
complexities, and parallelization forms of the operations. The results show a
significant variability in the performance of operations with respect to the
device used. The performances of operations with regular data access are
comparable or sometimes better on a MIC than that on a GPU. GPUs are more
efficient than MICs for operations that access data irregularly, because of the
lower bandwidth of the MIC for random data accesses. We propose new
performance-aware scheduling strategies that consider variabilities in
operation speedups. Our scheduling strategies significantly improve application
performance compared to classic strategies in hybrid configurations.Comment: 22 pages, 12 figures, 6 table
Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs
Sparse matrix multiplication is traditionally performed in memory and scales
to large matrices using the distributed memory of multiple nodes. In contrast,
we scale sparse matrix multiplication beyond memory capacity by implementing
sparse matrix dense matrix multiplication (SpMM) in a semi-external memory
(SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense
matrices in memory. Our SEM-SpMM incorporates many in-memory optimizations for
large power-law graphs. It outperforms the in-memory implementations of
Trilinos and Intel MKL and scales to billion-node graphs, far beyond the
limitations of memory. Furthermore, on a single large parallel machine, our
SEM-SpMM operates as fast as the distributed implementations of Trilinos using
five times as much processing power. We also run our implementation in memory
(IM-SpMM) to quantify the overhead of keeping data on SSDs. SEM-SpMM achieves
almost 100% performance of IM-SpMM on graphs when the dense matrix has more
than four columns; it achieves at least 65% performance of IM-SpMM on all
inputs. We apply our SpMM to three important data analysis tasks--PageRank,
eigensolving, and non-negative matrix factorization--and show that our SEM
implementations significantly advance the state of the art.Comment: published in IEEE Transactions on Parallel and Distributed System
Hardware Acceleration for Unstructured Big Data and Natural Language Processing.
The confluence of the rapid growth in electronic data in recent years, and the renewed interest in domain-specific hardware accelerators presents exciting technical opportunities. Traditional scale-out solutions for processing the vast amounts of text data have been shown to be energy- and cost-inefficient. In contrast, custom hardware accelerators can provide higher throughputs, lower latencies, and significant energy savings. In this thesis, I present a set of hardware accelerators for unstructured big-data processing and natural language processing.
The first accelerator, called HAWK, aims to speed up the processing of ad hoc queries against large in-memory logs. HAWK is motivated by the observation that traditional software-based tools for processing large text corpora use memory bandwidth inefficiently due to software overheads, and, thus, fall far short of peak scan rates possible on modern memory systems. HAWK is designed to process data at a constant rate of 32 GB/s—faster than most extant memory systems. I demonstrate that HAWK outperforms state-of-the-art software solutions for text processing, almost by an order of magnitude in many cases. HAWK occupies an area of 45 sq-mm in its pareto-optimal configuration and consumes 22 W of power, well within the area and power envelopes of modern CPU chips.
The second accelerator I propose aims to speed up similarity measurement calculations for semantic search in the natural language processing space. By leveraging the latency hiding concepts of multi-threading and simple scheduling mechanisms, my design maximizes functional unit utilization. This similarity measurement accelerator provides speedups of 36x-42x over optimized software running on server-class cores, while requiring 56x-58x lower energy, and only 1.3% of the area.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116712/1/prateekt_1.pd
Distributed Graphical Simulation in the Cloud
Graphical simulations are a cornerstone of modern media and films. But
existing software packages are designed to run on HPC nodes, and perform poorly
in the computing cloud. These simulations have complex data access patterns
over complex data structures, and mutate data arbitrarily, and so are a poor
fit for existing cloud computing systems. We describe a software architecture
for running graphical simulations in the cloud that decouples control logic,
computations and data exchanges. This allows a central controller to balance
load by redistributing computations, and recover from failures. Evaluations
show that the architecture can run existing, state-of-the-art simulations in
the presence of stragglers and failures, thereby enabling this large class of
applications to use the computing cloud for the first time
Big Geospatial Data processing in the IQmulus Cloud
Remote sensing instruments are continuously evolving in terms of spatial, spectral and temporal resolutions and hence provide exponentially increasing amounts of raw data. These volumes increase significantly faster than computing speeds. All these techniques record lots of data, yet in different data models and representations; therefore, resulting datasets require harmonization and integration prior to deriving meaningful information from them. All in all, huge datasets are available but raw data is almost of no value if not processed, semantically enriched and quality checked. The derived information need to be transferred and published to all level of possible users (from decision makers to citizens). Up to now, there are only limited automatic procedures for this; thus, a wealth of information is latent in many datasets. This paper presents the first achievements of the IQmulus EU FP7 research and development project with respect to processing and analysis of big geospatial data in the context of flood and waterlogging detection
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Muppet: MapReduce-Style Processing of Fast Data
MapReduce has emerged as a popular method to process big data. In the past
few years, however, not just big data, but fast data has also exploded in
volume and availability. Examples of such data include sensor data streams, the
Twitter Firehose, and Facebook updates. Numerous applications must process fast
data. Can we provide a MapReduce-style framework so that developers can quickly
write such applications and execute them over a cluster of machines, to achieve
low latency and high scalability? In this paper we report on our investigation
of this question, as carried out at Kosmix and WalmartLabs. We describe
MapUpdate, a framework like MapReduce, but specifically developed for fast
data. We describe Muppet, our implementation of MapUpdate. Throughout the
description we highlight the key challenges, argue why MapReduce is not well
suited to address them, and briefly describe our current solutions. Finally, we
describe our experience and lessons learned with Muppet, which has been used
extensively at Kosmix and WalmartLabs to power a broad range of applications in
social media and e-commerce.Comment: VLDB201
EasyFJP: Providing Hybrid Parallelism as a Concern for Divide and Conquer Java Applications
Because of the increasing availability of multi-core machines, clus- ters, Grids, and combinations of these there is now plenty of computational power,but today's programmers are not fully prepared to exploit parallelism. In particular, Java has helped in handling the heterogeneity of such environments. However, there is a lot of ground to cover regarding facilities to easily and elegantly parallelizing applications. One path to this end seems to be the synthesis of semi- automatic parallelism and Parallelism as a Concern (PaaC). The former allows users to be mostly unaware of parallel exploitation problems and at the same time manually optimize parallelized applications whenever necessary, while the latter allows applications to be separated from parallel-related code. In this paper, we present EasyFJP, an approach that implicitly exploits parallelism in Java applications based on the concept of fork-join synchronization pattern, a simple but effective abstraction for creating and coordinating parallel tasks. In addition, EasyFJP lets users to explicitly optimize applications through policies, or user-provided rules to dynamically regulate task granularity. Finally, EasyFJP relies on PaaC by means of source code generation techniques to wire applications and parallel-specific code together. Experiments with real-world applications on an emulated Grid and a cluster evidence that EasyFJP delivers competitive performance compared to state-of-the-art Java parallel programming tools.Fil: Mateos Diaz, Cristian Maximiliano. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina;Fil: Zunino Suarez, Alejandro Octavio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina;Fil: Hirsch Jofré, Matías Eberardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina
- …