1,567 research outputs found
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
From social networks to language modeling, the growing scale and importance
of graph data has driven the development of numerous new graph-parallel systems
(e.g., Pregel, GraphLab). By restricting the computation that can be expressed
and introducing new techniques to partition and distribute the graph, these
systems can efficiently execute iterative graph algorithms orders of magnitude
faster than more general data-parallel systems. However, the same restrictions
that enable the performance gains also make it difficult to express many of the
important stages in a typical graph-analytics pipeline: constructing the graph,
modifying its structure, or expressing computation that spans multiple graphs.
As a consequence, existing graph analytics pipelines compose graph-parallel and
data-parallel systems using external storage systems, leading to extensive data
movement and complicated programming model.
To address these challenges we introduce GraphX, a distributed graph
computation framework that unifies graph-parallel and data-parallel
computation. GraphX provides a small, core set of graph-parallel operators
expressive enough to implement the Pregel and PowerGraph abstractions, yet
simple enough to be cast in relational algebra. GraphX uses a collection of
query optimization techniques such as automatic join rewrites to efficiently
implement these graph-parallel operators. We evaluate GraphX on real-world
graphs and workloads and demonstrate that GraphX achieves comparable
performance as specialized graph computation systems, while outperforming them
in end-to-end graph pipelines. Moreover, GraphX achieves a balance between
expressiveness, performance, and ease of use
Incremental Processing and Optimization of Update Streams
Over the recent years, we have seen an increasing number of applications in networking, sensor networks, cloud computing, and environmental monitoring, which monitor, plan, control, and make decisions over data streams from multiple sources. We are interested in extending traditional stream processing techniques to meet the new challenges of these applications. Generally, in order to support genuine continuous query optimization and processing over data streams, we need to systematically understand how to address incremental optimization and processing of update streams for a rich class of queries commonly used in the applications.
Our general thesis is that efficient incremental processing and re-optimization of update streams can be achieved by various incremental view maintenance techniques if we cast the problems as incremental view maintenance problems over data streams. We focus on two incremental processing of update streams challenges currently not addressed in existing work on stream query processing: incremental processing of transitive closure queries over data streams, and incremental re-optimization of queries. In addition to addressing these specific challenges, we also develop a working prototype system Aspen, which serves as an end-to-end stream processing system that has been deployed as the foundation for a case study of our SmartCIS application. We validate our solutions both analytically and empirically on top of our prototype system Aspen, over a variety of benchmark workloads such as TPC-H and LinearRoad Benchmarks
Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management
Spreadsheet software is the tool of choice for interactive ad-hoc data
management, with adoption by billions of users. However, spreadsheets are not
scalable, unlike database systems. On the other hand, database systems, while
highly scalable, do not support interactivity as a first-class primitive. We
are developing DataSpread, to holistically integrate spreadsheets as a
front-end interface with databases as a back-end datastore, providing
scalability to spreadsheets, and interactivity to databases, an integration we
term presentational data management (PDM). In this paper, we make a first step
towards this vision: developing a storage engine for PDM, studying how to
flexibly represent spreadsheet data within a database and how to support and
maintain access by position. We first conduct an extensive survey of
spreadsheet use to motivate our functional requirements for a storage engine
for PDM. We develop a natural set of mechanisms for flexibly representing
spreadsheet data and demonstrate that identifying the optimal representation is
NP-Hard; however, we develop an efficient approach to identify the optimal
representation from an important and intuitive subclass of representations. We
extend our mechanisms with positional access mechanisms that don't suffer from
cascading update issues, leading to constant time access and modification
performance. We evaluate these representations on a workload of typical
spreadsheets and spreadsheet operations, providing up to 20% reduction in
storage, and up to 50% reduction in formula evaluation time
Database system architecture supporting coexisting query languages and data models
SIGLELD:D48239/84 / BLDSC - British Library Document Supply CentreGBUnited Kingdo
Provenance in Collaborative Data Sharing
This dissertation focuses on recording, maintaining and exploiting provenance information in Collaborative Data Sharing Systems (CDSS). These are systems that support data sharing across loosely-coupled, heterogeneous collections of relational databases related by declarative schema mappings. A fundamental challenge in a CDSS is to support the capability of update exchange --- which publishes a participant\u27s updates and then translates others\u27 updates to the participant\u27s local schema and imports them --- while tolerating disagreement between them and recording the provenance of exchanged data, i.e., information about the sources and mappings involved in their propagation. This provenance information can be useful during update exchange, e.g., to evaluate provenance-based trust policies. It can also be exploited after update exchange, to answer a variety of user queries, about the quality, uncertainty or authority of the data, for applications such as trust assessment, ranking for keyword search over databases, or query answering in probabilistic databases.
To address these challenges, in this dissertation we develop a novel model of provenance graphs that is informative enough to satisfy the needs of CDSS users and captures the semantics of query answering on various forms of annotated relations. We extend techniques from data integration, data exchange, incremental view maintenance and view update to define the formal semantics of unidirectional and bidirectional update exchange. We develop algorithms to perform update exchange incrementally while maintaining provenance information. We present strategies for implementing our techniques over an RDBMS and experimentally demonstrate their viability in the Orchestra prototype system. We define ProQL, a query language for provenance graphs that can be used by CDSS users to combine data querying with provenance testing as well as to compute annotations for their data, based on their provenance, that are useful for a variety of applications. Finally, we develop a prototype implementation ProQL over an RDBMS and indexing techniques to speed up provenance querying, evaluate experimentally the performance of provenance querying and the benefits of our indexing techniques
The XFM view adaptation mechanism: An essential component for XML data warehouses
In the past few years, with many organisations providing web services for business and communication purposes, large volumes of XML transactions take place on a daily basis.
In many cases, organisations maintain these transactions in their native XML format due to its flexibility for xchanging data between heterogeneous systems. This XML data
provides an important resource for decision support systems. As a consequence, XML technology has slowly been included within decision support systems of data warehouse
systems. The problem encountered is that existing native XML database systems suffer from poor performance in terms of managing data volume and response time for complex
analytical queries. Although materialised XML views can be used to improve the performance for XML data warehouses, update problems then become the bottleneck of using
materialised views. Specifically, synchronising materialised views in the face of changing view definitions, remains a significant issue. In this dissertation, we provide a method for XML-based data warehouses to manage updates caused by the change of view definitions (view redefinitions), which is referred to as the view adaptation problem. In our approach, views are defined using XPath and then modelled using a set of novel algebraic operators and fragments. XPath views are integrated into a single view graph called the XML Fragment
Materialisation (XFM) View Graph, where common parts between different views are shared and appear only once in the graph. Fragments within the view graph can be selected
for materialisation to facilitate the view adaptation process. While changes are applied, our view adaptation algorithms can quickly determine what part of the XFM view graph is affected.
The adaptation algorithms then perform a structural adaptation to update the view graph, followed by data adaptation to update materialised fragments
Scalable Automated Incrementalization for Real-Time Static Analyses
This thesis proposes a framework for easy development of static analyses, whose results are incrementalized to provide instantaneous feedback in an integrated development environment (IDE).
Today, IDEs feature many tools that have static analyses as their foundation to assess software quality and catch correctness problems.
Yet, these tools often fail to provide instantaneous feedback and are thus restricted to nightly build processes. This precludes developers from fixing issues at their inception time, i.e., when the problem and the developed solution are both still fresh in mind.
In order to provide instantaneous feedback, incrementalization is a well-known technique that utilizes the fact that developers make only small changes to the code and, hence, analysis results can be re-computed fast based on these changes. Yet, incrementalization requires carefully crafted static analyses. Thus, a manual approach to incrementalization is unattractive. Automated incrementalization can alleviate these problems and allows analyses writers to formulate their analyses as queries with the full data set in mind, without worrying over the semantics of incremental changes.
Existing approaches to automated incrementalization utilize standard technologies, such as deductive databases, that provide declarative query languages, yet also require to materialize the full dataset in main-memory, i.e., the memory is permanently blocked by the data required for the analyses. Other standard technologies such as relational databases offer better scalability due to persistence, yet require large transaction times for data. Both technologies are not a perfect match for integrating static analyses into an IDE, since the underlying data, i.e., the code base, is already persisted and managed by the IDE. Hence, transitioning the data into a database is redundant work.
In this thesis a novel approach is proposed that provides a declarative query language and automated incrementalization, yet retains in memory only a necessary minimum of data, i.e., only the data that is required for the incrementalization. The approach allows to declare static analyses as incrementally maintained views, where the underlying formalism for incrementalization is the relational algebra with extensions for object-orientation and recursion. The algebra allows to deduce which data is the necessary minimum for incremental maintenance and indeed shows that many views are self-maintainable, i.e., do not require to materialize memory at all. In addition an optimization for the algebra is proposed that allows to widen the range of self-maintainable views, based on domain knowledge of the underlying data. The optimization works similar to declaring primary keys for databases, i.e., the optimization is declared on the schema of the data, and defines which data is incrementally maintained in the same scope. The scope makes all analyses (views) that correlate only data within the boundaries of the scope self-maintainable.
The approach is implemented as an embedded domain specific language in a general-purpose programming language. The implementation can be understood as a database-like engine with an SQL-style query language and the execution semantics of the relational algebra. As such the system is a general purpose database-like query engine and can be used to incrementalize other domains than static analyses. To evaluate the approach a large variety of static analyses were sampled from real-world tools and formulated as incrementally maintained views in the implemented engine
- …