184 research outputs found
Optimization of Retrieval Algorithms on Large Scale Knowledge Graphs
Knowledge graphs have been shown to play an important role in recent
knowledge mining and discovery, for example in the field of life sciences or
bioinformatics. Although a lot of research has been done on the field of query
optimization, query transformation and of course in storing and retrieving
large scale knowledge graphs the field of algorithmic optimization is still a
major challenge and a vital factor in using graph databases. Few researchers
have addressed the problem of optimizing algorithms on large scale labeled
property graphs. Here, we present two optimization approaches and compare them
with a naive approach of directly querying the graph database. The aim of our
work is to determine limiting factors of graph databases like Neo4j and we
describe a novel solution to tackle these challenges. For this, we suggest a
classification schema to differ between the complexity of a problem on a graph
database. We evaluate our optimization approaches on a test system containing a
knowledge graph derived biomedical publication data enriched with text mining
data. This dense graph has more than 71M nodes and 850M relationships. The
results are very encouraging and - depending on the problem - we were able to
show a speedup of a factor between 44 and 3839
Processing Structured Data Streams
We elaborate this study in order to choose the most suitable technology to develop our proposal.
Second, we propose three methods to reduce the set of data to be processed by a query when working with large graphs, namely spatial, temporal and random approximations. These methods are based on Approximate Query Processing techniques and consist in discarding the information that is considered not relevant for the query. The reduction of the data is performed online with the processing and considers both spatial and temporal aspects of the data. Since discarding information in the source data may decrease the validity of the results, we also define the transformation error obtain with these methods in terms of accuracy, precision and recall.
Finally, we present a preprocessing algorithm, called SDR algorithm, that is also used to reduce the set of data to be processed, but without compromising the accuracy of the results. It calculates a subgraph from the source graph that contains only the relevant information for a given query. Since this technique is a preprocessing algorithm it is run offline before the actual processing begins. In addition, an incremental version of the algorithm is developed in order to update the subgraph as new information arrives to the system.A large amount of data is daily generated from different sources such as social networks, recommendation systems or geolocation systems. Moreover, this information tends to grow exponentially every year. Companies have discovered that the processing of these data may be important in order to obtain useful conclusions that serve for decision-making or the detection and resolution of problems in a more efficient way, for instance, through the study of trends, habits or customs of the population. The information provided by these sources typically consists of a non-structured and continuous data flow, where the relations among data elements conform graph structures. Inevitably, the processing performance of this information progressively decreases as the size of the data increases. For this reason, non-structured information is usually handled taking into account only the most recent data and discarding the rest, since they are considered not relevant when drawing conclusions. However, this approach is not enough in the case of sources that provide graph-structured data, since it is necessary to consider spatial features as well as temporal features. These spatial features refer to the relationships among the data elements. For example, some cases where it is important to consider spatial aspects are marketing techniques, which require information on the location of users and their possible needs, or the detection of diseases, that use data about genetic relationships among subjects or the geographic scope.
It is worth highlighting three main contributions from this dissertation. First, we provide a comparative study of seven of the most common processing platforms to work with huge graphs and the languages that are used to query them. This study measures the performance of the queries in terms of execution time, and the syntax complexity of the languages according to three parameters: number of characters, number of operators and number of internal variables
Improving query performance on dynamic graphs
Querying large models efficiently often imposes high demands on system resources such as memory, processing time, disk access or network latency. The situation becomes more complicated when data are highly interconnected, e.g. in the form of graph structures, and when data sources are heterogeneous, partly coming from dynamic systems and partly stored in databases. These situations are now common in many existing social networking applications and geo-location systems, which require specialized and efficient query algorithms in order to make informed decisions on time. In this paper, we propose an algorithm to improve the memory consumption and time performance of this type of queries by reducing the amount of elements to be processed, focusing only on the information that is relevant to the query but without compromising the accuracy of its results. To this end, the reduced subset of data is selected depending on the type of query and its constituent f ilters. Three case studies are used to evaluate the performance of our proposal, obtaining significant speedups in all cases.This work is partially supported by the European Commission (FEDER) and the Spanish Government under projects APOLO (US-1264651), HORATIO (RTI2018-101204-B-C21), EKIPMENT-PLUS (P18-FR-2895) and COSCA (PGC2018-094905B-I00)
gMark: Schema-Driven Generation of Graphs and Queries
Massive graph data sets are pervasive in contemporary application domains.
Hence, graph database systems are becoming increasingly important. In the
experimental study of these systems, it is vital that the research community
has shared solutions for the generation of database instances and query
workloads having predictable and controllable properties. In this paper, we
present the design and engineering principles of gMark, a domain- and query
language-independent graph instance and query workload generator. A core
contribution of gMark is its ability to target and control the diversity of
properties of both the generated instances and the generated workloads coupled
to these instances. Further novelties include support for regular path queries,
a fundamental graph query paradigm, and schema-driven selectivity estimation of
queries, a key feature in controlling workload chokepoints. We illustrate the
flexibility and practical usability of gMark by showcasing the framework's
capabilities in generating high quality graphs and workloads, and its ability
to encode user-defined schemas across a variety of application domains.Comment: Accepted in November 2016. URL:
http://ieeexplore.ieee.org/document/7762945/. in IEEE Transactions on
Knowledge and Data Engineering 201
Graph Pattern Matching in GQL and SQL/PGQ
As graph databases become widespread, JTC1 -- the committee in joint charge
of information technology standards for the International Organization for
Standardization (ISO), and International Electrotechnical Commission (IEC) --
has approved a project to create GQL, a standard property graph query language.
This complements a project to extend SQL with a new part, SQL/PGQ, which
specifies how to define graph views over an SQL tabular schema, and to run
read-only queries against them.
Both projects have been assigned to the ISO/IEC JTC1 SC32 working group for
Database Languages, WG3, which continues to maintain and enhance SQL as a
whole. This common responsibility helps enforce a policy that the identical
core of both PGQ and GQL is a graph pattern matching sub-language, here termed
GPML.
The WG3 design process is also analyzed by an academic working group, part of
the Linked Data Benchmark Council (LDBC), whose task is to produce a formal
semantics of these graph data languages, which complements their standard
specifications.
This paper, written by members of WG3 and LDBC, presents the key elements of
the GPML of SQL/PGQ and GQL in advance of the publication of these new
standards
An In-Depth Analysis on Efficiency and Vulnerabilities on a Cloud-Based Searchable Symmetric Encryption Solution
Searchable Symmetric Encryption (SSE) has come to be as an integral cryptographic approach in a world where digital privacy is essential. The capacity to search through encrypted data whilst maintaining its integrity meets the most important demand for security and confidentiality in a society that is increasingly dependent on cloud-based services and data storage. SSE offers efficient processing of queries over encrypted datasets, allowing entities to comply with data privacy rules while preserving database usability. Our research goes into this need, concentrating on the development and thorough testing of an SSE system based on Curtmola’s architecture and employing Advanced Encryption Standard (AES) in Cypher Block Chaining (CBC) mode. A primary goal of the research is to conduct a thorough evaluation of the security and performance of the system. In order to assess search performance, a variety of database settings were extensively tested, and the system's security was tested by simulating intricate threat scenarios such as count attacks and leakage abuse. The efficiency of operation and cryptographic robustness of the SSE system are critically examined by these reviews
PG-Triggers: Triggers for Property Graphs
Graph databases are emerging as the leading data management technology for
storing large knowledge graphs; significant efforts are ongoing to produce new
standards (such as the Graph Query Language, GQL), as well as enrich them with
properties, types, schemas, and keys. In this article, we propose PG-Triggers,
a complete proposal for adding triggers to Property Graphs, along the direction
marked by the SQL3 Standard. We define the syntax and semantics of PG-Triggers
and then illustrate how they can be implemented on top of Neo4j, one of the
most popular graph databases. In particular, we introduce a syntax-directed
translation from PG-Triggers into Neo4j, which makes use of the so-called APOC
triggers; APOC is a community-contributed library for augmenting the Cypher
query language supported by Neo4j. We also illustrate the use of PG-Triggers
through a life science application inspired by the COVID-19 pandemic. The main
result of this article is proposing reactive aspects within graph databases as
first-class citizens, so as to turn them into an ideal infrastructure for
supporting reactive knowledge management.Comment: 12 pages, 4 figures, 3 table
- …