29 research outputs found
Pregelix: Big(ger) Graph Analytics on A Dataflow Engine
There is a growing need for distributed graph processing systems that are
capable of gracefully scaling to very large graph datasets. Unfortunately, this
challenge has not been easily met due to the intense memory pressure imposed by
process-centric, message passing designs that many graph processing systems
follow. Pregelix is a new open source distributed graph processing system that
is based on an iterative dataflow design that is better tuned to handle both
in-memory and out-of-core workloads. As such, Pregelix offers improved
performance characteristics and scaling properties over current open source
systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up
to 35x speedup compared to distributed GraphLab), and makes more effective use
of available machine resources to support Big(ger) Graph Analytics
Types of Scientific Collaborators: A Perspective of Author Contribution Network
The purpose of this study is to investigate interaction between collaborators within individual studies by measuring how they made contributions to their studies. Author contribution network is constructed based on the author contribution statements of 140,000 full-text articles in PloS by viewing every collaborator as a node and a shared contribution as an edge. Three types of contributors are identified: general team-players, factotums, and mavericks. The preliminary result suggests that division of labor widely exists in scientific re-search and the latter two types of collaborators are common in small teams
Iterative MapReduce for Large Scale Machine Learning
Large datasets ("Big Data") are becoming ubiquitous because the potential
value in deriving insights from data, across a wide range of business and
scientific applications, is increasingly recognized. In particular, machine
learning - one of the foundational disciplines for data analysis, summarization
and inference - on Big Data has become routine at most organizations that
operate large clouds, usually based on systems such as Hadoop that support the
MapReduce programming paradigm. It is now widely recognized that while
MapReduce is highly scalable, it suffers from a critical weakness for machine
learning: it does not support iteration. Consequently, one has to program
around this limitation, leading to fragile, inefficient code. Further, reliance
on the programmer is inherently flawed in a multi-tenanted cloud environment,
since the programmer does not have visibility into the state of the system when
his or her program executes. Prior work has sought to address this problem by
either developing specialized systems aimed at stylized applications, or by
augmenting MapReduce with ad hoc support for saving state across iterations
(driven by an external loop). In this paper, we advocate support for looping as
a first-class construct, and propose an extension of the MapReduce programming
paradigm called {\em Iterative MapReduce}. We then develop an optimizer for a
class of Iterative MapReduce programs that cover most machine learning
techniques, provide theoretical justifications for the key optimization steps,
and empirically demonstrate that system-optimized programs for significant
machine learning tasks are competitive with state-of-the-art specialized
solutions
LDM: Lineage-Aware Data Management in Multi-tier Storage Systems
We design and develop LDM, a novel data management solution to cater the needs of applications exhibiting the lineage property, i.e. in which the current writes are future reads. In such a class of applications, slow writes significantly hurt the over-all performance of jobs, i.e. current writes determine the fate of next reads. We believe that in a large scale shared production cluster, the issues associated due to data management can be mitigated at a way higher layer in the hierarchy of the I/O path, even before requests to data access are made. Contrary to the current solutions to data management which are mostly reactive and/or based on heuristics, LDM is both deterministic and pro-active. We develop block-graphs, which enable LDM to capture the complete time-based data-task dependency associations, therefore use it to perform life-cycle management through tiering of data blocks. LDM amalgamates the information from the entire data center ecosystem, right from the application code, to file system mappings, the compute and storage devices topology, etc. to make oracle-like deterministic data management decisions. With trace-driven experiments, LDM is able to achieve 29–52% reduction in over-all data center workload execution time. Moreover, by deploying LDM with extensive pre-processing creates efficient data consumption pipelines, which also reduces write and read delays significantly
AsterixDB: A Scalable, Open Source BDMS
AsterixDB is a new, full-function BDMS (Big Data Management System) with a
feature set that distinguishes it from other platforms in today's open source
Big Data ecosystem. Its features make it well-suited to applications like web
data warehousing, social data storage and analysis, and other use cases related
to Big Data. AsterixDB has a flexible NoSQL style data model; a query language
that supports a wide range of queries; a scalable runtime; partitioned,
LSM-based data storage and indexing (including B+-tree, R-tree, and text
indexes); support for external as well as natively stored data; a rich set of
built-in types; support for fuzzy, spatial, and temporal types and queries; a
built-in notion of data feeds for ingestion of data; and transaction support
akin to that of a NoSQL store.
Development of AsterixDB began in 2009 and led to a mid-2013 initial open
source release. This paper is the first complete description of the resulting
open source AsterixDB system. Covered herein are the system's data model, its
query language, and its software architecture. Also included are a summary of
the current status of the project and a first glimpse into how AsterixDB
performs when compared to alternative technologies, including a parallel
relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data
analytics platform, for things that both technologies can do. Also included is
a brief description of some initial trials that the system has undergone and
the lessons learned (and plans laid) based on those early "customer"
engagements
On Software Infrastructure for Scalable Graph Analytics
Recently, there is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large datasets. In the mean time, in real-world applications, it is highly desirable to reduce the tedious, inefficient ETL (extract, transform, load) gap between tabular data processing systems and graph processing systems. Unfortunately, those challenges have not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow, as well as the separation of tabular data processing runtimes and graph processing runtimes.In this thesis, we explore the application of programming techniques and algorithms from the database systems world to the problem of scalable graph analysis. We first propose a bloat-aware design paradigm towards the development of efficient and scalable Big Data applications in object-oriented, GC enabled languages and demonstrate that programming under this paradigm does not incur significant programming burden but obtains remarkable performance gains (e.g., 2.5X). Based on the design paradigm, we then build Pregelix, an open source distributed graph processing system which is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15X speedup compared to Apache Giraph and up to 35X speedup compared to distributed GraphLab). Finally, we integrate Pregelix with the open source Big Data management system AsterixDB to offer users a mix of a vertex-oriented programming model and a declarative query language for richer forms of Big Graph analytics with reduced ETL pains
Big graph analytics platforms
A comprehensive survey that clearly summarizes the key features and techniques developed in existing big graph systems. It aims to help readers get a systematic picture of the landscape of recent big graph systems, focusing not just on the systems themselves, but also on the key innovations and design philosophies underlying them
Combined Static and Dynamic Automated Test Generation
In an object-oriented program, a unit test often consists of a sequence of method calls that create and mutate objects, then use them as arguments to a method under test. It is challenging to automatically generate sequences that are legal and behaviorally-diverse, that is, reaching as many different program states as possible. This paper proposes a combined static and dynamic automated test generation approach to address these problems, for code without a formal specification. Our approach first uses dynamic analysis to infer a call sequence model from a sample execution, then uses static analysis to identify method dependence relations based on the fields they may read or write. Finally, both the dynamicallyinferred model (which tends to be accurate but incomplete) and the statically-identified dependence information (which tends to be conservative) guide a random test generator to create legal and behaviorally-diverse tests. Our Palus tool implements this testing approach. We compared its effectiveness with a pure random approach, a dynamic-random approach (without a static phase), and a static-random approach (without a dynamic phase) on several popular open-source Java programs. Tests generated by Palus achieved higher structural coverage and found more bugs. Palus is also internally used in Google. It has found 22 previouslyunknown bugs in four well-tested Google products