1,567 research outputs found

    GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

    Full text link
    From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graph-parallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use

    Incremental Processing and Optimization of Update Streams

    Get PDF
    Over the recent years, we have seen an increasing number of applications in networking, sensor networks, cloud computing, and environmental monitoring, which monitor, plan, control, and make decisions over data streams from multiple sources. We are interested in extending traditional stream processing techniques to meet the new challenges of these applications. Generally, in order to support genuine continuous query optimization and processing over data streams, we need to systematically understand how to address incremental optimization and processing of update streams for a rich class of queries commonly used in the applications. Our general thesis is that efficient incremental processing and re-optimization of update streams can be achieved by various incremental view maintenance techniques if we cast the problems as incremental view maintenance problems over data streams. We focus on two incremental processing of update streams challenges currently not addressed in existing work on stream query processing: incremental processing of transitive closure queries over data streams, and incremental re-optimization of queries. In addition to addressing these specific challenges, we also develop a working prototype system Aspen, which serves as an end-to-end stream processing system that has been deployed as the foundation for a case study of our SmartCIS application. We validate our solutions both analytically and empirically on top of our prototype system Aspen, over a variety of benchmark workloads such as TPC-H and LinearRoad Benchmarks

    Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management

    Full text link
    Spreadsheet software is the tool of choice for interactive ad-hoc data management, with adoption by billions of users. However, spreadsheets are not scalable, unlike database systems. On the other hand, database systems, while highly scalable, do not support interactivity as a first-class primitive. We are developing DataSpread, to holistically integrate spreadsheets as a front-end interface with databases as a back-end datastore, providing scalability to spreadsheets, and interactivity to databases, an integration we term presentational data management (PDM). In this paper, we make a first step towards this vision: developing a storage engine for PDM, studying how to flexibly represent spreadsheet data within a database and how to support and maintain access by position. We first conduct an extensive survey of spreadsheet use to motivate our functional requirements for a storage engine for PDM. We develop a natural set of mechanisms for flexibly representing spreadsheet data and demonstrate that identifying the optimal representation is NP-Hard; however, we develop an efficient approach to identify the optimal representation from an important and intuitive subclass of representations. We extend our mechanisms with positional access mechanisms that don't suffer from cascading update issues, leading to constant time access and modification performance. We evaluate these representations on a workload of typical spreadsheets and spreadsheet operations, providing up to 20% reduction in storage, and up to 50% reduction in formula evaluation time

    Database system architecture supporting coexisting query languages and data models

    Get PDF
    SIGLELD:D48239/84 / BLDSC - British Library Document Supply CentreGBUnited Kingdo

    Provenance in Collaborative Data Sharing

    Get PDF
    This dissertation focuses on recording, maintaining and exploiting provenance information in Collaborative Data Sharing Systems (CDSS). These are systems that support data sharing across loosely-coupled, heterogeneous collections of relational databases related by declarative schema mappings. A fundamental challenge in a CDSS is to support the capability of update exchange --- which publishes a participant\u27s updates and then translates others\u27 updates to the participant\u27s local schema and imports them --- while tolerating disagreement between them and recording the provenance of exchanged data, i.e., information about the sources and mappings involved in their propagation. This provenance information can be useful during update exchange, e.g., to evaluate provenance-based trust policies. It can also be exploited after update exchange, to answer a variety of user queries, about the quality, uncertainty or authority of the data, for applications such as trust assessment, ranking for keyword search over databases, or query answering in probabilistic databases. To address these challenges, in this dissertation we develop a novel model of provenance graphs that is informative enough to satisfy the needs of CDSS users and captures the semantics of query answering on various forms of annotated relations. We extend techniques from data integration, data exchange, incremental view maintenance and view update to define the formal semantics of unidirectional and bidirectional update exchange. We develop algorithms to perform update exchange incrementally while maintaining provenance information. We present strategies for implementing our techniques over an RDBMS and experimentally demonstrate their viability in the Orchestra prototype system. We define ProQL, a query language for provenance graphs that can be used by CDSS users to combine data querying with provenance testing as well as to compute annotations for their data, based on their provenance, that are useful for a variety of applications. Finally, we develop a prototype implementation ProQL over an RDBMS and indexing techniques to speed up provenance querying, evaluate experimentally the performance of provenance querying and the benefits of our indexing techniques

    The XFM view adaptation mechanism: An essential component for XML data warehouses

    Get PDF
    In the past few years, with many organisations providing web services for business and communication purposes, large volumes of XML transactions take place on a daily basis. In many cases, organisations maintain these transactions in their native XML format due to its flexibility for xchanging data between heterogeneous systems. This XML data provides an important resource for decision support systems. As a consequence, XML technology has slowly been included within decision support systems of data warehouse systems. The problem encountered is that existing native XML database systems suffer from poor performance in terms of managing data volume and response time for complex analytical queries. Although materialised XML views can be used to improve the performance for XML data warehouses, update problems then become the bottleneck of using materialised views. Specifically, synchronising materialised views in the face of changing view definitions, remains a significant issue. In this dissertation, we provide a method for XML-based data warehouses to manage updates caused by the change of view definitions (view redefinitions), which is referred to as the view adaptation problem. In our approach, views are defined using XPath and then modelled using a set of novel algebraic operators and fragments. XPath views are integrated into a single view graph called the XML Fragment Materialisation (XFM) View Graph, where common parts between different views are shared and appear only once in the graph. Fragments within the view graph can be selected for materialisation to facilitate the view adaptation process. While changes are applied, our view adaptation algorithms can quickly determine what part of the XFM view graph is affected. The adaptation algorithms then perform a structural adaptation to update the view graph, followed by data adaptation to update materialised fragments

    Scalable Automated Incrementalization for Real-Time Static Analyses

    Get PDF
    This thesis proposes a framework for easy development of static analyses, whose results are incrementalized to provide instantaneous feedback in an integrated development environment (IDE). Today, IDEs feature many tools that have static analyses as their foundation to assess software quality and catch correctness problems. Yet, these tools often fail to provide instantaneous feedback and are thus restricted to nightly build processes. This precludes developers from fixing issues at their inception time, i.e., when the problem and the developed solution are both still fresh in mind. In order to provide instantaneous feedback, incrementalization is a well-known technique that utilizes the fact that developers make only small changes to the code and, hence, analysis results can be re-computed fast based on these changes. Yet, incrementalization requires carefully crafted static analyses. Thus, a manual approach to incrementalization is unattractive. Automated incrementalization can alleviate these problems and allows analyses writers to formulate their analyses as queries with the full data set in mind, without worrying over the semantics of incremental changes. Existing approaches to automated incrementalization utilize standard technologies, such as deductive databases, that provide declarative query languages, yet also require to materialize the full dataset in main-memory, i.e., the memory is permanently blocked by the data required for the analyses. Other standard technologies such as relational databases offer better scalability due to persistence, yet require large transaction times for data. Both technologies are not a perfect match for integrating static analyses into an IDE, since the underlying data, i.e., the code base, is already persisted and managed by the IDE. Hence, transitioning the data into a database is redundant work. In this thesis a novel approach is proposed that provides a declarative query language and automated incrementalization, yet retains in memory only a necessary minimum of data, i.e., only the data that is required for the incrementalization. The approach allows to declare static analyses as incrementally maintained views, where the underlying formalism for incrementalization is the relational algebra with extensions for object-orientation and recursion. The algebra allows to deduce which data is the necessary minimum for incremental maintenance and indeed shows that many views are self-maintainable, i.e., do not require to materialize memory at all. In addition an optimization for the algebra is proposed that allows to widen the range of self-maintainable views, based on domain knowledge of the underlying data. The optimization works similar to declaring primary keys for databases, i.e., the optimization is declared on the schema of the data, and defines which data is incrementally maintained in the same scope. The scope makes all analyses (views) that correlate only data within the boundaries of the scope self-maintainable. The approach is implemented as an embedded domain specific language in a general-purpose programming language. The implementation can be understood as a database-like engine with an SQL-style query language and the execution semantics of the relational algebra. As such the system is a general purpose database-like query engine and can be used to incrementalize other domains than static analyses. To evaluate the approach a large variety of static analyses were sampled from real-world tools and formulated as incrementally maintained views in the implemented engine
    corecore