666 research outputs found

    Graph Summarization

    Full text link
    The continuous and rapid growth of highly interconnected datasets, which are both voluminous and complex, calls for the development of adequate processing and analytical techniques. One method for condensing and simplifying such datasets is graph summarization. It denotes a series of application-specific algorithms designed to transform graphs into more compact representations while preserving structural patterns, query answers, or specific property distributions. As this problem is common to several areas studying graph topologies, different approaches, such as clustering, compression, sampling, or influence detection, have been proposed, primarily based on statistical and optimization methods. The focus of our chapter is to pinpoint the main graph summarization methods, but especially to focus on the most recent approaches and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie

    Time and Memory Efficient Parallel Algorithm for Structural Graph Summaries and two Extensions to Incremental Summarization and kk-Bisimulation for Long kk-Chaining

    Full text link
    We developed a flexible parallel algorithm for graph summarization based on vertex-centric programming and parameterized message passing. The base algorithm supports infinitely many structural graph summary models defined in a formal language. An extension of the parallel base algorithm allows incremental graph summarization. In this paper, we prove that the incremental algorithm is correct and show that updates are performed in time O(Δ⋅dk)\mathcal{O}(\Delta \cdot d^k), where Δ\Delta is the number of additions, deletions, and modifications to the input graph, dd the maximum degree, and kk is the maximum distance in the subgraphs considered. Although the iterative algorithm supports values of k>1k>1, it requires nested data structures for the message passing that are memory-inefficient. Thus, we extended the base summarization algorithm by a hash-based messaging mechanism to support a scalable iterative computation of graph summarizations based on kk-bisimulation for arbitrary kk. We empirically evaluate the performance of our algorithms using benchmark and real-world datasets. The incremental algorithm almost always outperforms the batch computation. We observe in our experiments that the incremental algorithm is faster even in cases when 50%50\% of the graph database changes from one version to the next. The incremental computation requires a three-layered hash index, which has a low memory overhead of only 8%8\% (±1%\pm 1\%). Finally, the incremental summarization algorithm outperforms the batch algorithm even with fewer cores. The iterative parallel kk-bisimulation algorithm computes summaries on graphs with over 1010M edges within seconds. We show that the algorithm processes graphs of 100+ 100+\,M edges within a few minutes while having a moderate memory consumption of <150<150 GB. For the largest BSBM1B dataset with 1 billion edges, it computes k=10k=10 bisimulation in under an hour

    Query-Oriented Summarization of RDF Graphs

    Get PDF
    International audienceThe Resource Description Framework (RDF) is the W3C’s graph data model for Semantic Web applications. We study the problem of RDF graph summarization: given an input RDF graph G, find an RDF graph G' which summarizes G as accurately as possible, while being possibly orders of magnitude smaller than the original graph. Our approach is query-oriented, i.e., querying a summary of a graph should reflect whether the query has some answers against this graph. The summaries are aimed as a help for query formulation and optimization. We introduce two summaries: a baseline which is compact and simple and satisfies certain accuracy and representativeness properties, but may oversimplify the RDF graph, and a refined one which trades some of these properties for more accuracy in representing the structure

    Doctor of Philosophy

    Get PDF
    dissertationLinked data are the de-facto standard in publishing and sharing data on the web. To date, we have been inundated with large amounts of ever-increasing linked data in constantly evolving structures. The proliferation of the data and the need to access and harvest knowledge from distributed data sources motivate us to revisit several classic problems in query processing and query optimization. The problem of answering queries over views is commonly encountered in a number of settings, including while enforcing security policies to access linked data, or when integrating data from disparate sources. We approach this problem by efficiently rewriting queries over the views to equivalent queries over the underlying linked data, thus avoiding the costs entailed by view materialization and maintenance. An outstanding problem of query rewriting is the number of rewritten queries is exponential to the size of the query and the views, which motivates us to study problem of multiquery optimization in the context of linked data. Our solutions are declarative and make no assumption for the underlying storage, i.e., being store-independent. Unlike relational and XML data, linked data are schema-less. While tracking the evolution of schema for linked data is hard, keyword search is an ideal tool to perform data integration. Existing works make crippling assumptions for the data and hence fall short in handling massive linked data with tens to hundreds of millions of facts. Our study for keyword search on linked data brought together the classical techniques in the literature and our novel ideas, which leads to much better query efficiency and quality of the results. Linked data also contain rich temporal semantics. To cope with the ever-increasing data, we have investigated how to partition and store large temporal or multiversion linked data for distributed and parallel computation, in an effort to achieve load-balancing to support scalable data analytics for massive linked data

    Linked Data Entity Summarization

    Get PDF
    On the Web, the amount of structured and Linked Data about entities is constantly growing. Descriptions of single entities often include thousands of statements and it becomes difficult to comprehend the data, unless a selection of the most relevant facts is provided. This doctoral thesis addresses the problem of Linked Data entity summarization. The contributions involve two entity summarization approaches, a common API for entity summarization, and an approach for entity data fusion

    Evaluating Knowledge Anchors in Data Graphs against Basic Level Objects

    Get PDF
    The growing number of available data graphs in the form of RDF Linked Da-ta enables the development of semantic exploration applications in many domains. Often, the users are not domain experts and are therefore unaware of the complex knowledge structures represented in the data graphs they in-teract with. This hinders users’ experience and effectiveness. Our research concerns intelligent support to facilitate the exploration of data graphs by us-ers who are not domain experts. We propose a new navigation support ap-proach underpinned by the subsumption theory of meaningful learning, which postulates that new concepts are grasped by starting from familiar concepts which serve as knowledge anchors from where links to new knowledge are made. Our earlier work has developed several metrics and the corresponding algorithms for identifying knowledge anchors in data graphs. In this paper, we assess the performance of these algorithms by considering the user perspective and application context. The paper address the challenge of aligning basic level objects that represent familiar concepts in human cog-nitive structures with automatically derived knowledge anchors in data graphs. We present a systematic approach that adapts experimental methods from Cognitive Science to derive basic level objects underpinned by a data graph. This is used to evaluate knowledge anchors in data graphs in two ap-plication domains - semantic browsing (Music) and semantic search (Ca-reers). The evaluation validates the algorithms, which enables their adoption over different domains and application contexts

    Path Patterns Visualization in Semantic Graphs

    Get PDF
    Graphs with a large number of nodes and edges are difficult to visualize. Semantic graphs add to the challenge since their nodes and edges have types and this information must be mirrored in the visualization. A common approach to cope with this difficulty is to omit certain nodes and edges, displaying sub-graphs of smaller size. However, other transformations can be used to abstract semantic graphs and this research explores a particular one, both to reduce the graph\u27s size and to focus on its path patterns. Antigraphs are a novel kind of graph designed to highlight path patterns using this kind of abstraction. They are composed of antinodes connected by antiedges, and these reflect respectively edges and nodes of the semantic graph. The prefix "anti" refers to this inversion of the nature of the main graph constituents. Antigraphs trade the visualization of nodes and edges by the visualization of graph path patterns involving typed edges. Thus, they are targeted to users that require a deep understanding of the semantic graph it represents, in particular of its path patterns, rather than to users wanting to browse the semantic graph\u27s content. Antigraphs help programmers querying the semantic graph or designers of semantic measures interested in using it as a semantic proxy. Hence, antigraphs are not expected to compete with other forms of semantic graph visualization but rather to be used a complementary tool. This paper provides a precise definition both of antigraphs and of the mapping of semantic graphs into antigraphs. Their visualization is obtained with antigraphs diagrams. A web application to visualize and interact with these diagrams was implemented to validate the proposed approach. Diagrams of well-known semantic graphs are also presented and discussed

    ExpRalytics: analyse expressive et efficace de graphes RDF

    Get PDF
    Large (Linked) Open Data are increasingly shared as RDF graphs today. However, such data does not yet reach its full potential in terms of sharing and reuse. We provide new methods to meaningfully summarize data graphs, with a particular focus on RDF graphs. One class of tools for this task are structural RDF graph summaries, which allow users to grasp the different connections between RDF graph nodes. To this end, we introduce our novel RDFQuotient tool that finds compact yet informative RDF graph summaries that can serve as first-sight visualizations of an RDF graph’s structure. We also consider the problem of automatically identifying the k most interesting aggregate queries that can be evaluated on an RDF graph, given an integer k and a user-specified interestingness function. Aggregate queries are routinely used to learn insights from relational data warehouses, and some prior research has addressed the problem of automatically recommending interesting aggregate queries.Les données ouvertes sont souvent partagées sous la forme de graphes RDF, qui sont une incarnation du principe Linked Open Data (données ouvertes liées). De telles données n’ont toutefois pas atteint leur entier potentiel d’utilisation et de partage. L’obstacle pour ce faire réside principalement au niveau de la capacité des utilisateurs à explorer, découvrir et saisir le contenu et des graphes RDF; cette tâche est complexe car les graphes sont naturellement hétérogènes, et peuvent être à la fois volumineux et complexes. Nous proposons de nouvelles méthodes pour résumer de grands graphes de données, avec un accent particulier sur les graphes RDF. A cette fin, nous avons proposé une nouvelle approché pour la construction de résumés structurels de graphes RDF, à savoir RDFQuotient.Nous considérons aussi le problème d’identifier automatiquement les requêtes d’agrégation les plus intéressantes qui peuvent être évaluées sur un graphe RDF
    • …
    corecore