881,456 research outputs found

    Compressed materialised views of semi-structured data

    Get PDF
    Query performance issues over semi-structured data have led to the emergence of materialised XML views as a means of restricting the data structure processed by a query. However preserving the conventional representation of such views remains a significant limiting factor especially in the context of mobile devices where processing power, memory usage and bandwidth are significant factors. To explore the concept of a compressed materialised view, we extend our earlier work on structural XML compression to produce a combination of structural summarisation and data compression techniques. These techniques provide a basis for efficiently dealing with both structural queries and valuebased predicates. We evaluate the effectiveness of such a scheme, presenting results and performance measures that show advantages of using such structures

    Updating semi-structured data

    Get PDF
    The Web has had a tremendous success with its support for the rapid and inexpensive exchange of information. A considerable body of data exchange is in the form of semi- structured data such as the eXtensible Markup Language (XML). XML, an effective standard to represent and exchange semi-structured data on the Web, is used ubiquitously in almost all areas of information technology. Most researchers in the XML area have concentrated on storing, querying and publishing XML while not many have paid attention to updating XML; thus the XML update area is not fully developed. We propose a solution for updating XML as a representation of semi-structured data. XML is updated through an object-relational database (ORDB) to exploit the maturity of the relational engine and the newer object features of the OR technology. The engine is used to enforce constraints during the updating of the XML whereas the object features are used to handle the XML hierarchical structure. Updating XML via ORDB makes it easier to join XML documents in an update and in turn joins of XML documents make it possible to keep non-redundant data in multiple XML documents. This thesis contributes a solution for the update of XML documents via an ORDB to advance our understanding of the XML update area. Rules for mapping XML structure and constraints to an ORDB schema are presented and a mechanism to handle XML cardinality constraint is provided. An XML update language, an extension to XQuery, has been designed and this language is translated into the standard SQL executed on an ORDB. To handle the recursive nature of XML, a recursive function updating XML data is translated into SQL commands equipped with a programming capability. A method is developed to reflect the changes from the ORDB to XML documents. A prototype of the solution has been implemented to help validate our approach. Experimental study to evaluate the performance of XML update processing based on the prototype has been conducted. The experimental results show that updating multiple XML documents storing non-redundant data yields a better performance than updating a single XML document storing redundant data; an ORDB can take advantage of this by caching data to a greater extent than a native XML database. The solution of updating XML documents via an ORDB can solve some problems in existing update methods as follows. Firstly, the preservation of XML constraints is handled by the ORDB engine. Secondly, non-redundant data is stored in linked XML documents; thus the problem of data inconsistency and low performance caused by data redundancy are solved. Thirdly, joins of XML documents are converted to joins of tables in SQL. Fourthly, fields or tables involved in regular path expressions can be tackled in a short time by using mapping data. Finally, a recursive function is translated into SQL commands equipped with a programming capability

    Conjoint data mining of structured and semi-structured data

    Get PDF
    With the knowledge management requirement growing, enterprises are becoming increasingly aware of the significance of interlinking business information across structured and semi-structured data sources. This problem has become more important with the growing amount of semi-structured data often found in XML repositories, web logs, biological databases, etc. Effectively creating links between semi-structured and structured data is a challenging and unresolved problem. Once an optimized method has been formulated, the process of data mining can be implemented in a conjoint manner. This paper investigates a way in which this challenging problem can be tackled. The proposed method is experimentally evaluated using a real world database and the effectiveness and the potential in discovering collective information is demonstrated

    Semi-automatic matching of semi-structured data updates

    Get PDF
    Includes bibliographical references.Data matching, also referred to as data linkage or field matching, is a technique used to combine multiple data sources into one data set. Data matching is used for data integration in a number of sectors and industries; from politics and health care to scientific applications. The motivation for this study was the observation of the day-to-day struggles of a large non-governmental organisation (NGO) in managing their membership database. With a membership base of close to 2.4 million, the challenges they face with regard to the capturing and processing of the semi-structured membership updates are monumental. Updates arrive from the field in a multitude of formats, often incomplete and unstructured, and expert knowledge is geographically localised. These issues are compounded by an extremely complex organisational hierarchy and a general lack of data validation processes. An online system was proposed for pre-processing input and then matching it against the membership database. Termed the Data Pre-Processing and Matching System (DPPMS), it allows for single or bulk updates. Based on the success of the DPPMS with the NGO’s membership database, it was subsequently used for pre-processing and data matching of semi-structured patient and financial customer data. Using the semi-automated DPPMS rather than a clerical data matching system, true positive matches increased by 21% while false negative matches decreased by 20%. The Recall, Precision and F-Measure values all improved and the risk of false positives diminished. The DPPMS was unable to match approximately 8% of provided records; this was largely due to human error during initial data capture. While the DPPMS greatly diminished the reliance on experts, their role remained pivotal during the final stage of the process

    Machine learning for managing structured and semi-structured data

    Get PDF
    As the digitalization of private, commercial, and public sectors advances rapidly, an increasing amount of data is becoming available. In order to gain insights or knowledge from these enormous amounts of raw data, a deep analysis is essential. The immense volume requires highly automated processes with minimal manual interaction. In recent years, machine learning methods have taken on a central role in this task. In addition to the individual data points, their interrelationships often play a decisive role, e.g. whether two patients are related to each other or whether they are treated by the same physician. Hence, relational learning is an important branch of research, which studies how to harness this explicitly available structural information between different data points. Recently, graph neural networks have gained importance. These can be considered an extension of convolutional neural networks from regular grids to general (irregular) graphs. Knowledge graphs play an essential role in representing facts about entities in a machine-readable way. While great efforts are made to store as many facts as possible in these graphs, they often remain incomplete, i.e., true facts are missing. Manual verification and expansion of the graphs is becoming increasingly difficult due to the large volume of data and must therefore be assisted or substituted by automated procedures which predict missing facts. The field of knowledge graph completion can be roughly divided into two categories: Link Prediction and Entity Alignment. In Link Prediction, machine learning models are trained to predict unknown facts between entities based on the known facts. Entity Alignment aims at identifying shared entities between graphs in order to link several such knowledge graphs based on some provided seed alignment pairs. In this thesis, we present important advances in the field of knowledge graph completion. For Entity Alignment, we show how to reduce the number of required seed alignments while maintaining performance by novel active learning techniques. We also discuss the power of textual features and show that graph-neural-network-based methods have difficulties with noisy alignment data. For Link Prediction, we demonstrate how to improve the prediction for unknown entities at training time by exploiting additional metadata on individual statements, often available in modern graphs. Supported with results from a large-scale experimental study, we present an analysis of the effect of individual components of machine learning models, e.g., the interaction function or loss criterion, on the task of link prediction. We also introduce a software library that simplifies the implementation and study of such components and makes them accessible to a wide research community, ranging from relational learning researchers to applied fields, such as life sciences. Finally, we propose a novel metric for evaluating ranking results, as used for both completion tasks. It allows for easier interpretation and comparison, especially in cases with different numbers of ranking candidates, as encountered in the de-facto standard evaluation protocols for both tasks.Mit der rasant fortschreitenden Digitalisierung des privaten, kommerziellen und öffentlichen Sektors werden immer größere Datenmengen verfügbar. Um aus diesen enormen Mengen an Rohdaten Erkenntnisse oder Wissen zu gewinnen, ist eine tiefgehende Analyse unerlässlich. Das immense Volumen erfordert hochautomatisierte Prozesse mit minimaler manueller Interaktion. In den letzten Jahren haben Methoden des maschinellen Lernens eine zentrale Rolle bei dieser Aufgabe eingenommen. Neben den einzelnen Datenpunkten spielen oft auch deren Zusammenhänge eine entscheidende Rolle, z.B. ob zwei Patienten miteinander verwandt sind oder ob sie vom selben Arzt behandelt werden. Daher ist das relationale Lernen ein wichtiger Forschungszweig, der untersucht, wie diese explizit verfügbaren strukturellen Informationen zwischen verschiedenen Datenpunkten nutzbar gemacht werden können. In letzter Zeit haben Graph Neural Networks an Bedeutung gewonnen. Diese können als eine Erweiterung von CNNs von regelmäßigen Gittern auf allgemeine (unregelmäßige) Graphen betrachtet werden. Wissensgraphen spielen eine wesentliche Rolle bei der Darstellung von Fakten über Entitäten in maschinenlesbaren Form. Obwohl große Anstrengungen unternommen werden, so viele Fakten wie möglich in diesen Graphen zu speichern, bleiben sie oft unvollständig, d. h. es fehlen Fakten. Die manuelle Überprüfung und Erweiterung der Graphen wird aufgrund der großen Datenmengen immer schwieriger und muss daher durch automatisierte Verfahren unterstützt oder ersetzt werden, die fehlende Fakten vorhersagen. Das Gebiet der Wissensgraphenvervollständigung lässt sich grob in zwei Kategorien einteilen: Link Prediction und Entity Alignment. Bei der Link Prediction werden maschinelle Lernmodelle trainiert, um unbekannte Fakten zwischen Entitäten auf der Grundlage der bekannten Fakten vorherzusagen. Entity Alignment zielt darauf ab, gemeinsame Entitäten zwischen Graphen zu identifizieren, um mehrere solcher Wissensgraphen auf der Grundlage einiger vorgegebener Paare zu verknüpfen. In dieser Arbeit stellen wir wichtige Fortschritte auf dem Gebiet der Vervollständigung von Wissensgraphen vor. Für das Entity Alignment zeigen wir, wie die Anzahl der benötigten Paare reduziert werden kann, während die Leistung durch neuartige aktive Lerntechniken erhalten bleibt. Wir erörtern auch die Leistungsfähigkeit von Textmerkmalen und zeigen, dass auf Graph-Neural-Networks basierende Methoden Schwierigkeiten mit verrauschten Paar-Daten haben. Für die Link Prediction demonstrieren wir, wie die Vorhersage für unbekannte Entitäten zur Trainingszeit verbessert werden kann, indem zusätzliche Metadaten zu einzelnen Aussagen genutzt werden, die oft in modernen Graphen verfügbar sind. Gestützt auf Ergebnisse einer groß angelegten experimentellen Studie präsentieren wir eine Analyse der Auswirkungen einzelner Komponenten von Modellen des maschinellen Lernens, z. B. der Interaktionsfunktion oder des Verlustkriteriums, auf die Aufgabe der Link Prediction. Außerdem stellen wir eine Softwarebibliothek vor, die die Implementierung und Untersuchung solcher Komponenten vereinfacht und sie einer breiten Forschungsgemeinschaft zugänglich macht, die von Forschern im Bereich des relationalen Lernens bis hin zu angewandten Bereichen wie den Biowissenschaften reicht. Schließlich schlagen wir eine neuartige Metrik für die Bewertung von Ranking-Ergebnissen vor, wie sie für beide Aufgaben verwendet wird. Sie ermöglicht eine einfachere Interpretation und einen leichteren Vergleich, insbesondere in Fällen mit einer unterschiedlichen Anzahl von Kandidaten, wie sie in den de-facto Standardbewertungsprotokollen für beide Aufgaben vorkommen

    Interactive Tuples Extraction from Semi-Structured Data

    Get PDF
    International audienceThis paper studies from a machine learning viewpoint the problem of extracting tuples of a target n-ary relation from tree structured data like XML or XHTML documents. Our system can extract, without any post-processing, tuples for all data structures including nested, rotated and cross tables. The wrapper induction algorithm we propose is based on two main ideas. It is incremental: partial tuples are extracted by increasing length. It is based on a representation-enrichment procedure: partial tuples of length i are encoded with the knowledge of extracted tu- ples of length i − 1. The algorithm is then set in a friendly interactive wrapper induction system for Web documents. We evaluate our system on several information extraction tasks over corporate Web sites. It achieves state-of-the-art results on simple data structures and succeeds on complex data structures where previous approaches fail. Experiments also show that our interactive framework significantly reduces the number of user interactions needed to build a wrapper
    • …
    corecore