99 research outputs found

    Optimizing Federated Queries Based on the Physical Design of a Data Lake

    Get PDF
    The optimization of query execution plans is known to be crucial for reducing the query execution time. In particular, query optimization has been studied thoroughly for relational databases over the past decades. Recently, the Resource Description Framework (RDF) became popular for publishing data on the Web. As a consequence, federations composed of different data models like RDF and relational databases evolved. One type of these federations are Semantic Data Lakes where every data source is kept in its original data model and semantically annotated with ontologies or controlled vocabularies. However, state-of-the-art query engines for federated query processing over Semantic Data Lakes often rely on optimization techniques tailored for RDF. In this paper, we present query optimization techniques guided by heuristics that take the physical design of a Data Lake into account. The heuristics are implemented on top of Ontario, a SPARQL query engine for Semantic Data Lakes. Using sourcespecific heuristics, the query engine is able to generate more efficient query execution plans by exploiting the knowledge about indexes and normalization in relational databases. We show that heuristics which take the physical design of the Data Lake into account are able to speed up query processing

    Learning To Scale Up Search-Driven Data Integration

    Get PDF
    A recent movement to tackle the long-standing data integration problem is a compositional and iterative approach, termed “pay-as-you-go” data integration. Under this model, the objective is to immediately support queries over “partly integrated” data, and to enable the user community to drive integration of the data that relate to their actual information needs. Over time, data will be gradually integrated. While the pay-as-you-go vision has been well-articulated for some time, only recently have we begun to understand how it can be manifested into a system implementation. One branch of this effort has focused on enabling queries through keyword search-driven data integration, in which users pose queries over partly integrated data encoded as a graph, receive ranked answers generated from data and metadata that is linked at query-time, and provide feedback on those answers. From this user feedback, the system learns to repair bad schema matches or record links. Many real world issues of uncertainty and diversity in search-driven integration remain open. Such tasks in search-driven integration require a combination of human guidance and machine learning. The challenge is how to make maximal use of limited human input. This thesis develops three methods to scale up search-driven integration, through learning from expert feedback: (1) active learning techniques to repair links from small amounts of user feedback; (2) collaborative learning techniques to combine users’ conflicting feedback; and (3) debugging techniques to identify where data experts could best improve integration quality. We implement these methods within the Q System, a prototype of search-driven integration, and validate their effectiveness over real-world datasets

    Diseño e implementación de un sistema de almacenamiento de datos sobre políticas de privacidad basado en técnicas de ETL

    Get PDF
    El problema de la integración de la información resulta de la gran dispersión que existe de la información en distintos almacenamiento. En este trabajo se ha resuelto dicho problema de integración para información proveniente de distintas fuentes de datos sobre políticas de privacidad. Para resolver el problema se han utilizado técnicas de extracción carga y transformación de la información sobre un sistema de almacenamiento centralizado.The information integrity problem comes from the huge dispersion in the information over the different storage systems. In this project, it has been resolved for privacy policy datasets coming from different sources by using methods for extracting, loading and transforming the information into a centralized storage system.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Grado en Ingeniería Informátic

    Linked Data y ontologías en una herramienta gráfica web

    Get PDF
    Esta línea de investigación se desarrolla en forma colaborativa entre docentesinvestigadores de la Universidad Nacional del Comahue y de la Universidad Nacional del Sur, en el marco de proyectos de investigación financiados por las universidades antes mencionadas. El objetivo general del trabajo de investigación es permitir la interacción entre fuentes de datos enlazados disponibles en la Web y la herramienta cliente-servidor para el modelado conceptual gráfico con soporte de razonamiento: crowd. De este modo, se espera poder navegar cualquier ontología asociada a los datos y observar sus relaciones de una manera gráfica, esto último, con el fin de facilitar su interpretación al usuario convencional.Eje: Innovación en Sistemas de Software.Red de Universidades con Carreras en Informática (RedUNCI

    m-tables: Representing Missing Data

    Get PDF
    Representation systems have been widely used to capture different forms of incomplete data in various settings. However, existing representation systems are not expressive enough to handle the more complex scenarios of missing data that can occur in practice: these could vary from missing attribute values, missing a known number of tuples, or even missing an unknown number of tuples. In this work, we propose a new representation system called m-tables, that can represent many different types of missing data. We show that m-tables form a closed, complete and strong representation system under both set and bag semantics and are strictly more expressive than conditional tables under both the closed and open world assumptions. We further study the complexity of computing certain and possible answers in m-tables. Finally, we discuss how to "interpret" m-tables through a novel labeling scheme that marks a type of generalized tuples as certain or possible

    Emergent semantics in distributed knowledge management

    Get PDF
    Organizations and enterprises have developed complex data and information exchange systems that are now vital for their daily operations. Currently available systems, however, face a major challenge. On todays global information infrastructure, data semantics is more and more context- and time-dependent, and cannot be fixed once and for all at design time. Identifying emerging relationships among previously unrelated information items (e.g., during data interchange) may dramatically increase their business value. This chapter introduce and discuss the notion of Emergent Semantics (ES), where both the representation of semantics and the discovery of the proper interpretation of symbols are seen as the result of a selforganizing process performed by distributed agents, exchanging symbols and adaptively developing the proper interpretation via multi-party cooperation and conflict resolution. Emergent data semantics is dynamically dependent on the collective behaviour of large communities of agents, which may have different and even conflicting interests and agendas. This is a research paradigm interpreting semantics from a pragmatic prospective. The chapter introduce this notion providing a discussion on the principles, research area and current state of the art

    A Review of Accessing Big Data with Significant Ontologies

    Get PDF
    Ontology Based Data Access (OBDA) is a recently proposed approach which is able to provide a conceptual view on relational data sources. It addresses the problem of the direct access to big data through providing end-users with an ontology that goes between users and sources in which the ontology is connected to the data via mappings. We introduced the languages used to represent the ontologies and the mapping assertions technique that derived the query answering from sources. Query answering is divided into two steps: (i) Ontology rewriting, in which the query is rewritten with respect to the ontology into new query; (ii) mapping rewriting the query that obtained from previous step reformulating it over the data sources using mapping assertions. In this survey, we aim to study the earlier works done by other researchers in the fields of ontology, mapping and query answering over data sources
    corecore