42 research outputs found

    Compacting frequent star patterns in RDF graphs

    Get PDF
    Knowledge graphs have become a popular formalism for representing entities and their properties using a graph data model, e.g., the Resource Description Framework (RDF). An RDF graph comprises entities of the same type connected to objects or other entities using labeled edges annotated with properties. RDF graphs usually contain entities that share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. We address the problem of identifying frequent star patterns in RDF graphs and devise the concept of factorized RDF graphs, which denote compact representations of RDF graphs where the number of frequent star patterns is minimized. We also develop computational methods to identify frequent star patterns and generate a factorized RDF graph, where compact RDF molecules replace frequent star patterns. A compact RDF molecule of a frequent star pattern denotes an RDF subgraph that instantiates the corresponding star pattern. Instead of having all the entities matching the original frequent star pattern, a surrogate entity is added and related to the properties of the frequent star pattern; it is linked to the entities that originally match the frequent star pattern. Since the edges between the entities and the objects in the frequent star pattern are replaced by edges between these entities and the surrogate entity of the compact RDF molecule, the size of the RDF graph is reduced. We evaluate the performance of our factorization techniques on several RDF graph benchmarks and compare with a baseline built on top gSpan, a state-of-the-art algorithm to detect frequent patterns. The outcomes evidence the efficiency of proposed approach and show that our techniques are able to reduce execution time of the baseline approach in at least three orders of magnitude. Additionally, RDF graph size can be reduced by up to 66.56% while data represented in the original RDF graph is preserved

    Compacting Frequent Star Patterns in RDF Graphs

    Get PDF
    Knowledge graphs have become a popular formalism for representing entities and their properties using a graph data model, e.g., the Resource Description Framework (RDF). An RDF graph comprises entities of the same type connected to objects or other entities using labeled edges annotated with properties. RDF graphs usually contain entities that share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. We address the problem of identifying frequent star patterns in RDF graphs and devise the concept of factorized RDF graphs, which denote compact representations of RDF graphs where the number of frequent star patterns is minimized. We also develop computational methods to identify frequent star patterns and generate a factorized RDF graph, where compact RDF molecules replace frequent star patterns. A compact RDF molecule of a frequent star pattern denotes an RDF subgraph that instantiates the corresponding star pattern. Instead of having all the entities matching the original frequent star pattern, a surrogate entity is added and related to the properties of the frequent star pattern; it is linked to the entities that originally match the frequent star pattern. We evaluate the performance of our factorization techniques on several RDF graph benchmarks and compare with a baseline built on top of gSpan, a state-of-the-art algorithm to detect frequent patterns. The outcomes evidence the efficiency of proposed approach and show that our techniques are able to reduce execution time of the baseline approach in at least three orders of magnitude reducing the RDF graph size by up to 66.56%

    Compact semantic representations of observational data

    Get PDF
    Das Konzept des Internet der Dinge (IoT) ist in mehreren Bereichen weit verbreitet, damit Geräte miteinander interagieren und bestimmte Aufgaben erfüllen können. IoT-Geräte umfassen verschiedene Konzepte, z.B. Sensoren, Programme, Computer und Aktoren. IoT-Geräte beobachten ihre Umgebung, um Informationen zu sammeln und miteinander zu kommunizieren, um gemeinsame Aufgaben zu erfüllen. Diese Vorrichtungen erzeugen kontinuierlich Beobachtungsdatenströme, die zu historischen Daten werden, wenn diese Beobachtungen gespeichert werden. Durch die Zunahme der Anzahl der IoT-Geräte wird eine große Menge an Streaming- und historischen Beobachtungsdaten erzeugt. Darüber hinaus wurden mehrere Ontologien, wie die Semantic Sensor Network (SSN) Ontologie, für die semantische Annotation von Beobachtungsdaten vorgeschlagen - entweder Stream oder historisch. Das Resource Description Framework (RDF) ist ein weit verbreitetes Datenmodell zur semantischen Beschreibung der Datensätze. Semantische Annotation bietet ein gemeinsames Verständnis für die Verarbeitung und Analyse von Beobachtungsdaten. Durch das Hinzufügen von Semantik wird die Datengröße jedoch weiter erhöht, insbesondere wenn die Beobachtungswerte von mehreren Geräten redundant erfasst werden. So können beispielsweise mehrere Sensoren Beobachtungen erzeugen, die den gleichen Wert für die relative Luftfeuchtigkeit in einem bestimmten Zeitstempel und einer bestimmten Stadt anzeigen. Diese Situation kann in einem RDF-Diagramm mit vier RDF-Tripel dargestellt werden, wobei Beobachtungen als Tripel dargestellt werden, die das beobachtete Phänomen, die Maßeinheit, den Zeitstempel und die Koordinaten beschreiben. Die RDF-Tripel einer Beobachtung sind mit dem gleichen Thema verbunden. Solche Beobachtungen teilen sich die gleichen Objekte in einer bestimmten Gruppe von Eigenschaften, d.h. sie entsprechen einem Sternmuster, das sich aus diesen Eigenschaften und Objekten zusammensetzt. Wenn die Anzahl dieser Subjektentitäten oder Eigenschaften in diesen Sternmustern groß ist, wird die Größe des RDF-Diagramms und der Abfrageverarbeitung negativ beeinflusst; wir bezeichnen diese Sternmuster als häufige Sternmuster. Diese Arbeit befasst sich mit dem Problem der Identifizierung von häufigen Sternenmustern in RDF-Diagrammen und entwickelt Berechnungsmethoden, um häufige Sternmuster zu identifizieren und ein faktorisiertes RDF-Diagramm zu erzeugen, bei dem die Anzahl der häufigen Sternmuster minimiert wird. Darüber hinaus wenden wir diese faktorisierten RDF-Darstellungen über historische semantische Sensordaten an, die mit der SSN-Ontologie beschrieben werden, und präsentieren tabellarische Darstellungen von faktorisierten semantischen Sensordaten, um Big Data-Frameworks auszunutzen. Darüber hinaus entwickelt diese Arbeit einen wissensbasierten Ansatz namens DESERT, der in der Lage ist, bei Bedarf Streamdaten zu faktorisieren und semantisch anzureichern (on-Demand factorizE and Semantically Enrich stReam daTa). Wir bewerten die Leistung unserer vorgeschlagenen Techniken anhand mehrerer RDF-Diagramm-Benchmarks. Die Ergebnisse zeigen, dass unsere Techniken in der Lage sind, häufige Sternmuster effektiv und effizient zu erkennen, und die Größe der RDF-Diagramme kann um bis zu 66,56% reduziert werden, während die im ursprünglichen RDF-Diagramm dargestellten Daten erhalten bleiben. Darüber hinaus sind die kompakten Darstellungen in der Lage, die Anzahl der RDF-Tripel um mindestens 53,25% in historischen Beobachtungsdaten und bis zu 94,34% in Beobachtungsdatenströmen zu reduzieren. Darüber hinaus reduzieren die Ergebnisse der Anfrageauswertung über historische Daten die Ausführungszeit der Anfrage um bis zu drei Größenordnungen. In Beobachtungsdatenströmen wird die Größe der zur Beantwortung der Anfrage benötigten Daten um 92,53% reduziert, wodurch der Speicherplatzbedarf zur Beantwortung der Anfragen reduziert wird. Diese Ergebnisse belegen, dass IoT-Daten mit den vorgeschlagenen kompakten Darstellungen effizient dargestellt werden können, wodurch die negativen Auswirkungen semantischer Annotationen auf das IoT-Datenmanagement reduziert werden.The Internet of Things (IoT) concept has been widely adopted in several domains to enable devices to interact with each other and perform certain tasks. IoT devices encompass different concepts, e.g., sensors, programs, computers, and actuators. IoT devices observe their surroundings to collect information and communicate with each other in order to perform mutual tasks. These devices continuously generate observational data streams, which become historical data when these observations are stored. Due to an increase in the number of IoT devices, a large amount of streaming and historical observational data is being produced. Moreover, several ontologies, like the Semantic Sensor Network (SSN) Ontology, have been proposed for semantic annotation of observational data-either streams or historical. Resource Description Framework (RDF) is widely adopted data model to semantically describe the datasets. Semantic annotation provides a shared understanding for processing and analysis of observational data. However, adding semantics, further increases the data size especially when the observation values are redundantly sensed by several devices. For example, several sensors can generate observations indicating the same value for relative humidity in a given timestamp and city. This situation can be represented in an RDF graph using four RDF triples where observations are represented as triples that describe the observed phenomenon, the unit of measurement, the timestamp, and the coordinates. The RDF triples of an observation are associated with the same subject. Such observations share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these subject entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. This thesis addresses the problem of identifying frequent star patterns in RDF graphs and develop computational methods to identify frequent star patterns and generate a factorized RDF graph where the number of frequent star patterns is minimized. Furthermore, we apply these factorized RDF representations over historical semantic sensor data described using the SSN ontology and present tabular-based representations of factorized semantic sensor data in order to exploit Big Data frameworks. In addition, this thesis devises a knowledge-driven approach named DESERT that is able to on-Demand factorizE and Semantically Enrich stReam daTa. We evaluate the performance of our proposed techniques on several RDF graph benchmarks. The outcomes show that our techniques are able to effectively and efficiently detect frequent star patterns and RDF graph size can be reduced by up to 66.56% while data represented in the original RDF graph is preserved. Moreover, the compact representations are able to reduce the number of RDF triples by at least 53.25% in historical observational data and upto 94.34% in observational data streams. Additionally, query evaluation results over historical data reduce query execution time by up to three orders of magnitude. In observational data streams the size of the data required to answer the query is reduced by 92.53% reducing the memory space requirements to answer the queries. These results provide evidence that IoT data can be efficiently represented using the proposed compact representations, reducing thus, the negative impact that semantic annotations may have on IoT data management

    Formal Concept Analysis for Semantic Compression of Knowledge Graph Versions

    Get PDF
    International audienceRecent years have witnessed the increase of openly available knowledge graphs online. These graphs are often structured according to the W3C semantic web standard RDF. With this availability of information comes the challenge of coping with dataset versions as information may change in time and therefore deprecates the former knowledge graph. Several solutions have been proposed to deal with data versioning, mainly based on computing data deltas and having an incremental approach to keep track of the version history. In this article, we describe a novel method that relies on aggregating graph versions to obtain one single complete graph. Our solution semantically compresses similar and common edges together to obtain a final graph smaller than the sum of the distinct versioned ones. Technically, our method takes advantage of FCA to match graph elements together. We also describe how this compressed graph can be queried without being unzipped, using standard methods

    Effective reorganization and self-indexing of big semantic data

    Get PDF
    En esta tesis hemos analizado la redundancia estructural que los grafos RDF poseen y propuesto una técnica de preprocesamiento: RDF-Tr, que agrupa, reorganiza y recodifica los triples, tratando dos fuentes de redundancia estructural subyacentes a la naturaleza del esquema RDF. Hemos integrado RDF-Tr en HDT y k2-triples, reduciendo el tamaño que obtienen los compresores originales, superando a las técnicas más prominentes del estado del arte. Hemos denominado HDT++ y k2-triples++ al resultado de aplicar RDF-Tr en cada compresor. En el ámbito de la compresión RDF se utilizan estructuras compactas para construir autoíndices RDF, que proporcionan acceso eficiente a los datos sin descomprimirlos. HDT-FoQ es utilizado para publicar y consumir grandes colecciones de datos RDF. Hemos extendido HDT++, llamándolo iHDT++, para resolver patrones SPARQL, consumiendo menos memoria que HDT-FoQ, a la vez que acelera la resolución de la mayoría de las consultas, mejorando la relación espacio-tiempo del resto de autoíndices.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Doctorado en Informátic

    Normalization Techniques For Improving The Performance Of Knowledge Graph Creation Pipelines

    Get PDF
    With the rapid growth of data within the web, demands on discovering information within data and consecutively exploiting knowledge graphs rise much more than we think it does. Data integration systems can be of great help to meet this precious demand in that they offer transformation of data from various sources and with different volumes. To this end, a data integration system takes advantage of utilizing mapping rules-- specified in a language like RML -- to integrate data collected from various data sources into a knowledge graph. However, large data sources may suffer from various data quality issues, being redundant one of them. Regarding this, the Semantic Web community contributes to Knowledge Engineering with techniques to create a knowledge graph efficiently. The thesis reported in this document tackles creating knowledge graphs in the presence of data sources with redundant data, and a novel normalization theory is proposed to solve this problem. This theory covers not only the characteristics of the data sources but also mapping rules used to integrate the data sources into a knowledge graph. Based on this, three normal forms are proposed and an algorithm for transforming mapping rules and data sources into these normal forms. The proposed approach's performance is evaluated in different testbeds composed of real-world data and synthetic data. The observed results suggest that the proposed techniques can dramatically reduce the execution time of knowledge graph creation. Therefore, this thesis's normalization theory contributes to the repertoire of tools that facilitate the creation of knowledge graphs at scale

    Binary RDF for Scalable Publishing, Exchanging and Consumption in the Web of Data

    Get PDF
    El actual diluvio de datos está inundando la web con grandes volúmenes de datos representados en RDF, dando lugar a la denominada 'Web de Datos'. En esta tesis proponemos, en primer lugar, un estudio profundo de aquellos textos que nos permitan abordar un conocimiento global de la estructura real de los conjuntos de datos RDF, HDT, que afronta la representación eficiente de grandes volúmenes de datos RDF a través de estructuras optimizadas para su almacenamiento y transmisión en red. HDT representa efizcamente un conjunto de datos RDF a través de su división en tres componentes: la cabecera (Header), el diccionario (Dictionary) y la estructura de sentencias RDF (Triples). A continuación, nos centramos en proveer estructuras eficientes de dichos componentes, ocupando un espacio comprimido al tiempo que se permite el acceso directo a cualquier dat

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    9th International Workshop "What can FCA do for Artificial Intelligence?" (FCA4AI 2021)

    Get PDF
    International audienceFormal Concept Analysis (FCA) is a mathematically well-founded theory aimed at classification and knowledge discovery that can be used for many purposes in Artificial Intelligence (AI). The objective of the ninth edition of the FCA4AI workshop (see http://www.fca4ai.hse.ru/) is to investigate several issues such as: how can FCA support various AI activities (knowledge discovery, knowledge engineering, machine learning, data mining, information retrieval, recommendation...), how can FCA be extended in order to help AI researchers to solve new and complex problems in their domains, and how FCA can play a role in current trends in AI such as explainable AI and fairness of algorithms in decision making.The workshop was held in co-location with IJCAI 2021, Montréal, Canada, August, 28 2021

    Implementation of Web Query Languages Reconsidered

    Get PDF
    Visions of the next generation Web such as the "Semantic Web" or the "Web 2.0" have triggered the emergence of a multitude of data formats. These formats have different characteristics as far as the shape of data is concerned (for example tree- vs. graph-shaped). They are accompanied by a puzzlingly large number of query languages each limited to one data format. Thus, a key feature of the Web, namely to make it possible to access anything published by anyone, is compromised. This thesis is devoted to versatile query languages capable of accessing data in a variety of Web formats. The issue is addressed from three angles: language design, common, yet uniform semantics, and common, yet uniform evaluation. % Thus it is divided in three parts: First, we consider the query language Xcerpt as an example of the advocated class of versatile Web query languages. Using this concrete exemplar allows us to clarify and discuss the vision of versatility in detail. Second, a number of query languages, XPath, XQuery, SPARQL, and Xcerpt, are translated into a common intermediary language, CIQLog. This language has a purely logical semantics, which makes it easily amenable to optimizations. As a side effect, this provides the, to the best of our knowledge, first logical semantics for XQuery and SPARQL. It is a very useful tool for understanding the commonalities and differences of the considered languages. Third, the intermediate logical language is translated into a query algebra, CIQCAG. The core feature of CIQCAG is that it scales from tree- to graph-shaped data and queries without efficiency losses when tree-data and -queries are considered: it is shown that, in these cases, optimal complexities are achieved. CIQCAG is also shown to evaluate each of the aforementioned query languages with a complexity at least as good as the best known evaluation methods so far. For example, navigational XPath is evaluated with space complexity O(q d) and time complexity O(q n) where q is the query size, n the data size, and d the depth of the (tree-shaped) data. CIQCAG is further shown to provide linear time and space evaluation of tree-shaped queries for a larger class of graph-shaped data than any method previously proposed. This larger class of graph-shaped data, called continuous-image graphs, short CIGs, is introduced for the first time in this thesis. A (directed) graph is a CIG if its nodes can be totally ordered in such a manner that, for this order, the children of any node form a continuous interval. CIQCAG achieves these properties by employing a novel data structure, called sequence map, that allows an efficient evaluation of tree-shaped queries, or of tree-shaped cores of graph-shaped queries on any graph-shaped data. While being ideally suited to trees and CIGs, the data structure gracefully degrades to unrestricted graphs. It yields a remarkably efficient evaluation on graph-shaped data that only a few edges prevent from being trees or CIGs
    corecore