8 research outputs found

    Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries

    Full text link
    Graph processing has become an important part of multiple areas of computer science, such as machine learning, computational sciences, medical applications, social network analysis, and many others. Numerous graphs such as web or social networks may contain up to trillions of edges. Often, these graphs are also dynamic (their structure changes over time) and have domain-specific rich data associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size of such datasets, combined with the irregular nature of graph processing, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., triple stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and ACID). 51 graph database systems are presented and compared, including Neo4j, OrientDB, or Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we describe research and engineering challenges to outline the future of graph databases

    View and Index Selection on Graph Databases

    Get PDF
    Μια από τις σημαντικότερες πτυχές των βάσεων δεδομένων γραφημάτων με εγγενή επεξεργασία γράφων είναι η ιδιότητα γειτνίασης χωρίς ευρετήριο (index-free-adjacency), βάση της οποίας όλοι οι κόμβοι του γράφου έχουν άμεση φυσική διεύθυνση RAM και δείκτες σε άλλους γειτονικούς κόμβους. Η ιδιότητα γειτνίασης χωρίς ευρετήριο επιταχύνει την απάντηση ερωτημάτων για ερωτήματα που συνδέονται με έναν (ή περισσότερους) συγκεκριμένους κόμβους εντός του γραφήματος, δηλαδή τους κόμβους αγκύρωσης (anchor nodes). Ο αντίστοιχος κόμβος αγκύρωσης χρησιμοποιείται ως σημείο εκκίνησης για την απάντηση στο ερώτημα εξετάζοντας τους παρακείμενους κόμβους του αντί για ολόκληρο το γράφημα. Παρόλα αυτά, τα ερωτήματα που δεν αρχίζουν από κόμβους αγκύρωσης απαντώνται πολύ πιο δύσκολα, καθώς ο σχεδιαστής ερωτημάτων(query planner) θα πρέπει να εξετάσει ένα μεγάλο μέρος του γραφήματος για να απαντήσει στο αντίστοιχο ερώτημα. Σε αυτή την εργασία μελετάμε τεχνικές επιλογής όψεων και ευρετηρίων προκειμένου να επιταχύνουμε την προαναφερθείσα κατηγορία ερωτημάτων. Αναλύουμε διαφορετικές στρατηγικές επιλογής όψεων και ευρετηρίων για την απάντηση ερωτημάτων και δείχνουμε ότι, ανάλογα με τα χαρακτηριστικά του ερωτήματος, τη βάση δεδομένων γραφημάτων και το αντίστοιχο σύνολο απαντήσεων, μια διαφορετική στρατηγική μπορεί να είναι βέλτιστη μεταξύ των εναλλακτικών λύσεων ευρετηρίασης και υλοποίησης προβολής. Πριν από την επιλογή των όψεων και των ευρετηρίων, το σύστημά μας χρησιμοποιεί τεχνικές εξόρυξης προτύπων για να μαντέψει τα χαρακτηριστικά των μελλοντικών ερωτημάτων. Έτσι, ο αρχικός φόρτος εργασίας του ερωτήματος αντιπροσωπεύεται από μια πολύ μικρότερη σύνοψη των μοτίβων ερωτημάτων που είναι πιο πιθανό να εμφανιστούν σε μελλοντικά ερωτήματα, με κάθε μοτίβο να έχει τον αντίστοιχο αναμενόμενο αριθμό εμφανίσεων. Η στρατηγική επιλογής μας βασίζεται σε μια στρατηγική επιλογής άπληστης όψεων & ευρετηρίων που σε κάθε βήμα της εκτέλεσής της προσπαθεί να μεγιστοποιήσει την αναλογία του οφέλους από την υλοποίηση μιας/ενός όψεως/ευρετηρίου, προς το αντίστοιχο κόστος αποθήκευσής τους. Ο αλγόριθμος επιλογής μας είναι εμπνευσμένος από τον αντίστοιχο άπληστο αλγόριθμο για τη «Μεγιστοποίηση μιας μη ελαττούμενης συνάρτησης υποδομοστοιχειωτού συνόλου που υπόκειται σε περιορισμό σακιδίου». Η πειραματική μας αξιολόγηση δείχνει ότι όλα τα βήματα της διαδικασίας επιλογής ευρετηρίου ολοκληρώνονται σε λίγα δευτερόλεπτα, ενώ οι αντίστοιχες επανεγγραφές επιταχύνουν το 15,44% των ερωτημάτων στον φόρτο εργασίας των ερωτημάτων της DbPedia. Αυτά τα ερωτήματα εκτελούνται στο 1,63% του αρχικού τους χρόνου κατά μέσο όρο.One of the most important aspects of native graph-database systems is their index-free adjacency property that enforces the nodes to have direct physical RAM addresses and physically point to other adjacent nodes. The index-free adjacency property accelerates query answering for queries that are bound to one (or more) specific nodes within the graph, namely anchor nodes. The corresponding anchor node is used as the starting point for answering the query by examining its adjacent nodes instead of the whole graph. Nevertheless, non-anchored-node queries are much harder to answer since the query planner should examine a large portion of the graph in order to answer the corresponding query. In this work we study view and index selection techniques in order to accelerate the aforementioned class of queries. We analyze different index and view selection strategies for query answering and show that, depending on the characteristics of the query, the graph database, and the corresponding answer set, a different strategy may be optimal among the indexing and view materialization alternatives. Before selecting the views and indices, our system employs pattern mining techniques in order to guess the characteristics of future queries. Thus, the initial query workload is represented by a much smaller summary of the query patterns that are most likely to appear in future queries, each pattern having a corresponding expected number of appearances. Our selection strategy is based on a greedy view & index selection strategy that at each step of its execution tries to maximize the ratio of the benefit of materializing a view/index, to the corresponding cost of storing it. Our selection algorithm is inspired by the corresponding greedy algorithm for “Maximizing a Nondecreasing Submodular Set Function Subject to a Knapsack Constraint”. Our experimental evaluation shows that all the steps of the index selection process are completed in a few seconds, while the corresponding rewritings accelerate 15.44% of the queries in the DbPedia query workload. Those queries are executed in 1.63% of their initial time on average

    Neural Graph Databases

    Full text link
    Graph databases (GDBs) enable processing and analysis of unstructured, complex, rich, and usually vast graph datasets. Despite the large significance of GDBs in both academia and industry, little effort has been made into integrating them with the predictive power of graph neural networks (GNNs). In this work, we show how to seamlessly combine nearly any GNN model with the computational capabilities of GDBs. For this, we observe that the majority of these systems are based on, or support, a graph data model called the Labeled Property Graph (LPG), where vertices and edges can have arbitrarily complex sets of labels and properties. We then develop LPG2vec, an encoder that transforms an arbitrary LPG dataset into a representation that can be directly used with a broad class of GNNs, including convolutional, attentional, message-passing, and even higher-order or spectral models. In our evaluation, we show that the rich information represented as LPG labels and properties is properly preserved by LPG2vec, and it increases the accuracy of predictions regardless of the targeted learning task or the used GNN model, by up to 34% compared to graphs with no LPG labels/properties. In general, LPG2vec enables combining predictive power of the most powerful GNNs with the full scope of information encoded in the LPG model, paving the way for neural graph databases, a class of systems where the vast complexity of maintained data will benefit from modern and future graph machine learning methods

    Bench-Ranking: ettekirjutav analüüsimeetod suurte teadmiste graafide päringutele

    Get PDF
    Relatsiooniliste suurandmete (BD) töötlemisraamistike kasutamine suurte teadmiste graafide töötlemiseks kätkeb endas võimalust päringu jõudlust optimeerimida. Kaasaegsed BD-süsteemid on samas keerulised andmesüsteemid, mille konfiguratsioonid omavad olulist mõju jõudlusele. Erinevate raamistike ja konfiguratsioonide võrdlusuuringud pakuvad kogukonnale parimaid tavasid parema jõudluse saavutamiseks. Enamik neist võrdlusuuringutest saab liigitada siiski vaid kirjeldavaks ja diagnostiliseks analüütikaks. Lisaks puudub ühtne standard nende uuringute võrdlemiseks kvantitatiivselt järjestatud kujul. Veelgi enam, suurte graafide töötlemiseks vajalike konveierite kavandamine eeldab täiendavaid disainiotsuseid mis tulenevad mitteloomulikust (relatsioonilisest) graafi töötlemise paradigmast. Taolisi disainiotsuseid ei saa automaatselt langetada, nt relatsiooniskeemi, partitsioonitehnika ja salvestusvormingute valikut. Käesolevas töös käsitleme kuidas me antud uurimuslünga täidame. Esmalt näitame disainiotsuste kompromisside mõju BD-süsteemide jõudluse korratavusele suurte teadmiste graafide päringute tegemisel. Lisaks näitame BD-raamistike jõudluse kirjeldavate ja diagnostiliste analüüside piiranguid suurte graafide päringute tegemisel. Seejärel uurime, kuidas lubada ettekirjutavat analüütikat järjestamisfunktsioonide ja mitmemõõtmeliste optimeerimistehnikate (nn "Bench-Ranking") kaudu. See lähenemine peidab kirjeldava tulemusanalüüsi keerukuse, suunates praktiku otse teostatavate teadlike otsusteni.Leveraging relational Big Data (BD) processing frameworks to process large knowledge graphs yields a great interest in optimizing query performance. Modern BD systems are yet complicated data systems, where the configurations notably affect the performance. Benchmarking different frameworks and configurations provides the community with best practices for better performance. However, most of these benchmarking efforts are classified as descriptive and diagnostic analytics. Moreover, there is no standard for comparing these benchmarks based on quantitative ranking techniques. Moreover, designing mature pipelines for processing big graphs entails considering additional design decisions that emerge with the non-native (relational) graph processing paradigm. Those design decisions cannot be decided automatically, e.g., the choice of the relational schema, partitioning technique, and storage formats. Thus, in this thesis, we discuss how our work fills this timely research gap. Particularly, we first show the impact of those design decisions’ trade-offs on the BD systems’ performance replicability when querying large knowledge graphs. Moreover, we showed the limitations of the descriptive and diagnostic analyses of BD frameworks’ performance for querying large graphs. Thus, we investigate how to enable prescriptive analytics via ranking functions and Multi-Dimensional optimization techniques (called ”Bench-Ranking”). This approach abstracts out from the complexity of descriptive performance analysis, guiding the practitioner directly to actionable informed decisions.https://www.ester.ee/record=b553332

    The LDBC Financial Benchmark

    Full text link
    The Linked Data Benchmark Council's Financial Benchmark (LDBC FinBench) is a new effort that defines a graph database benchmark targeting financial scenarios such as anti-fraud and risk control. The benchmark has one workload, the Transaction Workload, currently. It captures OLTP scenario with complex, simple read queries and write queries that continuously insert or delete data in the graph. Compared to the LDBC SNB, the LDBC FinBench differs in application scenarios, data patterns, and query patterns. This document contains a detailed explanation of the data used in the LDBC FinBench, the definition of transaction workload, a detailed description for all queries, and instructions on how to use the benchmark suite.Comment: For the source code of this specification, see the ldbc_finbench_docs repository on Githu

    ETSI SmartM2M Technical Report 103715; Study for oneM2M; Discovery and Query solutions analysis & selection

    Get PDF
    The oneM2M system has implemented basic native discovery capabilities. In order to enhance the semantic capabilities of the oneM2M architecture by providing solid contributions to the oneM2M standards, four Technical Reports have been developed. Each of them is the outcome of a special study phase: requirements, study, simulation and standardization phase. The present document covers the second phase and provides the basis for the other documents. It identifies, defines and analyses relevant approaches with respect to the use cases and requirements developed in ETSI TR 103 714 The most appropriate one will be selected

    Enabling Graph Analysis Over Relational Databases

    Get PDF
    Complex interactions and systems can be modeled by analyzing the connections between underlying entities or objects described by a dataset. These relationships form networks (graphs), the analysis of which has been shown to provide tremendous value in areas ranging from retail to many scientific domains. This value is obtained by using various methodologies from network science-- a field which focuses on studying network representations in the real world. In particular "graph algorithms", which iteratively traverse a graph's connections, are often leveraged to gain insights. To take advantage of the opportunity presented by graph algorithms, there have been a variety of specialized graph data management systems, and analysis frameworks, proposed in recent years, which have made significant advances in efficiently storing and analyzing graph-structured data. Most datasets however currently do not reside in these specialized systems but rather in general-purpose relational database management systems (RDBMS). A relational or similarly structured system is typically governed by a schema of varying strictness that implements constraints and is meticulously designed for the specific enterprise. Such structured datasets contain many relationships between the entities therein, that can be seen as latent or "hidden" graphs that exist inherently inside the datasets. However, these relationships can only typically be traversed via conducting expensive JOINs using SQL or similar languages. Thus, in order for users to efficiently traverse these latent graphs to conduct analysis, data needs to be transformed and migrated to specialized systems. This creates barriers that hinder and discourage graph analysis; our vision is to break these barriers. In this dissertation we investigate the opportunities and challenges involved in efficiently leveraging relationships within data stored in structured databases. First, we present GraphGen, a lightweight software layer that is independent from the underlying database, and provides interfaces for graph analysis of data in RDBMSs. GraphGen is the first such system that introduces an intuitive high-level language for specifying graphs of interest, and utilizes in-memory graph representations to tackle the problems associated with analyzing graphs that are hidden inside structured datasets. We show GraphGen can analyze such graphs in orders of magnitude less memory, and often computation time, while eliminating manual Extract-Transform-Load (ETL) effort. Second, we examine how in-memory graph representations of RDBMS data can be used to enhance relational query processing. We present a novel, general framework for executing GROUP BY aggregation over conjunctive queries which avoids materialization of intermediate JOIN results, and wrap this framework inside a multi-way relational operator called Join-Agg. We show that Join-Agg can compute aggregates over a class of relational and graph queries using orders of magnitude less memory and computation time

    Measurement of service innovation project success:A practical tool and theoretical implications

    Get PDF
    corecore