329 research outputs found

    Discovery Agent : An Interactive Approach for the Discovery of Inclusion Dependencies

    Get PDF
    The information integration problem is a hard yet important problem in the field of databases. The goal of information integration is to provide unified views on diverse data among several resources. This subject has been studied for a long time. The integration can be performed using several ways. Schema integration using inclusion dependency constraints is one of them. The problem of discovering inclusion dependencies among input relations is NP-complete in terms of the number of attributes. Two significant algorithms address this problem: FIND2 by Andreas Koeller and Zigzag by Fabien De Marchi. Both algorithms discover inclusion dependencies among input relations on small scale databases having relatively few attributes. Because of the data discrepancy, they do not scale well with higher numbers of attributes. We propose an approach of incorporating human intelligence into the algorithmic discovery of inclusion dependencies. To use human intelligence, we design an agent called the discovery agent, to provide a communication bridge between an algorithm and a user. The discovery agent demonstrates the progress of the discovery process and provides sufficient user controls to govern the discovery process into the right direction. In this thesis, we present a prototype of the discovery agent based upon the FIND2 algorithm, which utilizes most of the phase-wise behavior of the algorithm and demonstrate how human observer and algorithm work together to achieve higher performance and better output accuracy. The goal of the discovery agent is to make the discovery process truly interactive between system and user as well as to produce the desired and accurate result. The discovery agent can deliver an applicable and feasible approximation of an NP-complete problem with the help of suitable algorithm and appropriate human expertise

    Integration of Heterogeneous Databases: Discovery of Meta-Information and Maintenance of Schema-Restructuring Views

    Get PDF
    In today\u27s networked world, information is widely distributed across many independent databases in heterogeneous formats. Integrating such information is a difficult task and has been adressed by several projects. However, previous integration solutions, such as the EVE-Project, have several shortcomings. Database contents and structure change frequently, and users often have incomplete information about the data content and structure of the databases they use. When information from several such insufficiently described sources is to be extracted and integrated, two problems have to be solved: How can we discover the structure and contents of and interrelationships among unknown databases, and how can we provide durable integration views over several such databases? In this dissertation, we have developed solutions for those key problems in information integration. The first part of the dissertation addresses the fact that knowledge about the interrelationships between databases is essential for any attempt at solving the information integration problem. We are presenting an algorithm called FIND2 based on the clique-finding problem in graphs and k-uniform hypergraphs to discover redundancy relationships between two relations. Furthermore, the algorithm is enhanced by heuristics that significantly reduce the search space when necessary. Extensive experimental studies on the algorithm both with and without heuristics illustrate its effectiveness on a variety of real-world data sets. The second part of the dissertation addresses the durable view problem and presents the first algorithm for incremental view maintenance in schema-restructuring views. Such views are essential for the integration of heterogeneous databases. They are typically defined in schema-restructuring query languages like SchemaSQL, which can transform schema into data and vice versa, making traditional view maintenance based on differential queries impossible. Based on an existing algebra for SchemaSQL, we present an update propagation algorithm that propagates updates along the query algebra tree and prove its correctness. We also propose optimizations on our algorithm and present experimental results showing its benefits over view recomputation

    Profiling relational data: a survey

    Get PDF
    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    Navigating Diverse Datasets in the Face of Uncertainty

    Get PDF
    When exploring big volumes of data, one of the challenging aspects is their diversity of origin. Multiple files that have not yet been ingested into a database system may contain information of interest to a researcher, who must curate, understand and sieve their content before being able to extract knowledge. Performance is one of the greatest difficulties in exploring these datasets. On the one hand, examining non-indexed, unprocessed files can be inefficient. On the other hand, any processing before its understanding introduces latency and potentially un- necessary work if the chosen schema matches poorly the data. We have surveyed the state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle data in-situ performantly. Another major difficulty is matching files from multiple origins since their schema and layout may not be compatible or properly documented. Most surveyed solutions overlook this problem, especially for numeric, uncertain data, as is typical in fields like astronomy. The main objective of our research is to assist data scientists during the exploration of unprocessed, numerical, raw data distributed across multiple files based solely on its intrinsic distribution. In this thesis, we first introduce the concept of Equally-Distributed Dependencies, which provides the foundations to match this kind of dataset. We propose PresQ, a novel algorithm that finds quasi-cliques on hypergraphs based on their expected statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can be assumed to be the same. Finally, we propose a two-sample statistical test based on Self-Organizing Maps (SOM). This method can outperform, in terms of power, other classifier-based two- sample tests, being in some cases comparable to kernel-based methods, with the advantage of being interpretable. Both PresQ and the SOM-based statistical test can provide insights that drive serendipitous discoveries

    Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn

    Get PDF
    AbstractBio-ontology development is a resource-consuming task despite the many open source ontologies available for reuse. Various strategies and tools for bottom-up ontology development have been proposed from a computing angle, yet the most obvious one from a domain expert perspective is unexplored: the abundant diagrams in the sciences. To speed up and simplify bio-ontology development, we propose a detailed, micro-level, procedure, DiDOn, to formalise such semi-structured biological diagrams availing also of a foundational ontology for more precise and interoperable subject domain semantics. The approach is illustrated using Pathway Studio as case study

    Expressivity Within Second-Order Transitive-Closure Logic

    Get PDF
    Second-order transitive-closure logic, SO(TC), is an expressive declarative language that captures the complexity class PSPACE. Already its monadic fragment, MSO(TC), allows the expression of various NP-hard and even PSPACE-hard problems in a natural and elegant manner. As SO(TC) offers an attractive framework for expressing properties in terms of declaratively specified computations, it is interesting to understand the expressivity of different features of the language. This paper focuses on the fragment MSO(TC), as well on the purely existential fragment SO(2TC)(exists); in 2TC, the TC operator binds only tuples of relation variables. We establish that, with respect to expressive power, SO(2TC)(exists) collapses to existential first-order logic. In addition we study the relationship of MSO(TC) to an extension of MSO(TC) with counting features (CMSO(TC)) as well as to order-invariant MSO. We show that the expressive powers of CMSO(TC) and MSO(TC) coincide. Moreover we establish that, over unary vocabularies, MSO(TC) strictly subsumes order-invariant MSO

    Data Doctor: An Efficient Data Profiling and Quality Improvement Tool

    Get PDF
    Many business and IT managers face the same problem: the data that serves as the foundation for their business applications inconsistent,inaccurate,and unreliable. Data profiling is the solution to this problem and, as such, is a fundamental step that should begin every data-driven initiative.In this paper we have implemented the technique of data profiling such as Column Analysis,Frequency Analysis, Null Rule Analysis,Constant Analysis, Empty Column Analysis and Unique Analysis. DOI: 10.17762/ijritcc2321-8169.160411
    • …
    corecore