88 research outputs found
Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications
MapReduce is a popular programming paradigm for developing large-scale,
data-intensive computation. Many frameworks that implement this paradigm have
recently been developed. To leverage these frameworks, however, developers must
become familiar with their APIs and rewrite existing code. Casper is a new tool
that automatically translates sequential Java programs into the MapReduce
paradigm. Casper identifies potential code fragments to rewrite and translates
them in two steps: (1) Casper uses program synthesis to search for a program
summary (i.e., a functional specification) of each code fragment. The summary
is expressed using a high-level intermediate language resembling the MapReduce
paradigm and verified to be semantically equivalent to the original using a
theorem prover. (2) Casper generates executable code from the summary, using
either the Hadoop, Spark, or Flink API. We evaluated Casper by automatically
converting real-world, sequential Java benchmarks to MapReduce. The resulting
benchmarks perform up to 48.2x faster compared to the original.Comment: 12 pages, additional 4 pages of references and appendi
GRAPE: Parallel Graph Query Engine
The need for graph computations is evident in a multitude of use cases. To support
computations on large-scale graphs, several parallel systems have been developed.
However, existing graph systems require users to recast algorithms into new models,
which makes parallel graph computations as a privilege to experienced users only.
Moreover, real world applications often require much more complex graph processing
workflows than previously evaluated. In response to these challenges, the thesis
presents GRAPE, a distributed graph computation system, shipped with various applications
for social network analysis, social media marketing and functional dependencies
on graphs.
Firstly, the thesis presents the foundation of GRAPE. The principled approach of
GRAPE is based on partial evaluation and incremental computation. Sequential graph
algorithms can be plugged into GRAPE with minor changes, and get parallelized as a
whole. The termination and correctness are guaranteed under a monotonic condition.
Secondly, as an application on GRAPE, the thesis proposes graph-pattern association
rules (GPARs) for social media marketing. GPARs help users discover regularities
between entities in social graphs and identify potential customers by exploring social
influence. The thesis studies the problem of discovering top-k diversified GPARs and
the problem of identifying potential customers with GPARs. Although both are NP-
hard, parallel scalable algorithms on GRAPE are developed, which guarantee a polynomial
speedup over sequential algorithms with the increase of processors.
Thirdly, the thesis proposes quantified graph patterns (QGPs), an extension of
graph patterns by supporting simple counting quantifiers on edges. QGPs naturally express
universal and existential quantification, numeric and ratio aggregates, as well as
negation. The thesis proves that the matching problem of QGPs remains NP-complete
in the absence of negation, and is DP-complete for general QGPs. In addition, the
thesis introduces quantified graph association rules defined with QGPs, to identify potential
customers in social media marketing.
Finally, to address the issue of data consistency, the thesis proposes a class of functional
dependencies for graphs, referred to as GFDs. GFDs capture both attribute-value
dependencies and topological structures of entities. The satisfiability and implication
problems for GFDs are studied and proved to be coNP-complete and NP-complete,
respectively. The thesis also proves that the validation problem for GFDs is coNP-
complete. The parallel algorithms developed on GRAPE verify that GFDs provide an
effective approach to detecting inconsistencies in knowledge and social graphs
Big Data Computing for Geospatial Applications
The convergence of big data and geospatial computing has brought forth challenges and opportunities to Geographic Information Science with regard to geospatial data management, processing, analysis, modeling, and visualization. This book highlights recent advancements in integrating new computing approaches, spatial methods, and data management strategies to tackle geospatial big data challenges and meanwhile demonstrates opportunities for using big data for geospatial applications. Crucial to the advancements highlighted in this book is the integration of computational thinking and spatial thinking and the transformation of abstract ideas and models to concrete data structures and algorithms
A Survey of Scholarly Data: From Big Data Perspective
Recently, there has been a shifting focus of organizations and governments towards digitization of academic and technical documents, adding a new facet to the concept of digital libraries. The volume, variety and velocity of this generated data, satisfies the big data definition, as a result of which, this scholarly reserve is popularly referred to as big scholarly data. In order to facilitate data analytics for big scholarly data, architectures and services for the same need to be developed. The evolving nature of research problems has made them essentially interdisciplinary. As a result, there is a growing demand for scholarly applications like collaborator discovery, expert finding and research recommendation systems, in addition to several others. This research paper investigates the current trends and identifies the existing challenges in development of a big scholarly data platform, with specific focus on directions for future research and maps them to the different phases of the big data lifecycle
- …