12,946 research outputs found

    GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

    Full text link
    From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graph-parallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use

    A unified view of data-intensive flows in business intelligence systems : a survey

    Get PDF
    Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft

    AMaχoS—Abstract Machine for Xcerpt

    Get PDF
    Web query languages promise convenient and efficient access to Web data such as XML, RDF, or Topic Maps. Xcerpt is one such Web query language with strong emphasis on novel high-level constructs for effective and convenient query authoring, particularly tailored to versatile access to data in different Web formats such as XML or RDF. However, so far it lacks an efficient implementation to supplement the convenient language features. AMaχoS is an abstract machine implementation for Xcerpt that aims at efficiency and ease of deployment. It strictly separates compilation and execution of queries: Queries are compiled once to abstract machine code that consists in (1) a code segment with instructions for evaluating each rule and (2) a hint segment that provides the abstract machine with optimization hints derived by the query compilation. This article summarizes the motivation and principles behind AMaχoS and discusses how its current architecture realizes these principles

    Adaptive Energy-aware Scheduling of Dynamic Event Analytics across Edge and Cloud Resources

    Full text link
    The growing deployment of sensors as part of Internet of Things (IoT) is generating thousands of event streams. Complex Event Processing (CEP) queries offer a useful paradigm for rapid decision-making over such data sources. While often centralized in the Cloud, the deployment of capable edge devices on the field motivates the need for cooperative event analytics that span Edge and Cloud computing. Here, we identify a novel problem of query placement on edge and Cloud resources for dynamically arriving and departing analytic dataflows. We define this as an optimization problem to minimize the total makespan for all event analytics, while meeting energy and compute constraints of the resources. We propose 4 adaptive heuristics and 3 rebalancing strategies for such dynamic dataflows, and validate them using detailed simulations for 100 - 1000 edge devices and VMs. The results show that our heuristics offer O(seconds) planning time, give a valid and high quality solution in all cases, and reduce the number of query migrations. Furthermore, rebalance strategies when applied in these heuristics have significantly reduced the makespan by around 20 - 25%.Comment: 11 pages, 7 figure
    corecore