    Conceptual Workflow for Complex Data Integration using AXML

    International audienceRelevant data for decision support systems are available everywhere and in various formats. Such data must be integrated into a unified format. Traditional data integration approaches are not adapted to handle complex data. Thus, we exploit the Active XML language for integrating complex data. Its XML part allows to unify, model and store complex data. Moreover, its services part tackles the distributed issue of data sources. Accordingly, different integration tasks are proposed as services. These services are managed via a set of active rules that are built upon metadata and events of the integration system. In this paper, we design an architecture for integrating complex data autonomously. We have also designed the workflow for data integration tasks

    K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources

    The integration of heterogeneous data sources and software systems is a major issue in the biomed ical community and several approaches have been explored: linking databases, on-the- fly integration through views, and integration through warehousing. In this paper we report on our experiences with two systems that were developed at the University of Pennsylvania: an integration system called K2, which has primarily been used to provide views over multiple external data sources and software systems; and a data warehouse called GUS which downloads, cleans, integrates and annotates data from multiple external data sources. Although the view and warehouse approaches each have their advantages, there is no clear winner . Therefore, users must consider how the data is to be used, what the performance guarantees must be, and how much programmer time and expertise is available to choose the best strategy for a particular application

    Making Linked Open Data Writable with Provenance Semirings

    Linked Open Data cloud (LOD) is essentially read-only, re- straining the possibility of collaborative knowledge construction. To sup- port collaboration, we need to make the LOD writable. In this paper, we propose a vision for a writable linked data where each LOD participant can define updatable materialized views from data hosted by other par- ticipants. Consequently, building a writable LOD can be reduced to the problem of SPARQL self-maintenance of Select-Union recursive mate- rialized views. We propose TM-Graph, an RDF-Graph annotated with elements of a specialized provenance semiring to maintain consistency of these views and we analyze complexity in space and traffic

    Estocada: Stockage Hybride et Ré-écriture sous Contraintes d'Intégrité

    National audienceLa production croissante de données numériques a conduit a l'´ emergence d'une grande variété de systemes de gestion de données (Data Management Systems, ou DMS). Dans ce contexte, les applications a usage intensif de données ont besoin (i) d' accéder a des données hétérogenes de grande taille (" Big Data "), ayant une structure potentiellement complexe, et (ii) de manipuler des données de façon efficace afin de garantir une bonne performance de l'application. Comme ces différents systemes sont spécialisés sur certaines opérations mais sont moins performants sur d'autres, il peut s' avérer essentiel pour une application d'utiliser plusieurs DMS en même temps. Dans ce contexte nous présentons Estocada, une application donnant la possibilité de tirer profit simultanément de plusieurs DMSs et permettant une manipulation efficace et automatique de données de grande taille et hétérogenes, offrant ainsi un meilleur support aux applications a usage intensif de données. Dans Estocada, les données sont reparties dans plusieurs fragments qui sont stockés dans différents DMSs. Pour répondrè a une requêtè a partir de ces fragments , Estocada est basé sur la reecriture de requêtes sous contraintes; cesdernìeres sont utilisées pour représenter les différents modeles de données et la répartition des fragments entre les differents DMSs

    Ontop: answering SPARQL queries over relational databases

    We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL2QL and RDFS ontologies), and its support for all major relational databases

    Optimized Seamless Integration of Biomolecular Data

    Today, scientific data is inevitably digitized, stored in a wide variety of heterogeneous formats, and is accessible over the Internet. Scientists need to access an integrated view of multiple remote or local heterogeneous data sources. They then integrate the results of complex queries and apply further analysis and visualization to support the task of scientific discovery. Building such a digital library for scientific discovery requires accessing and manipulating data extracted from flat files or databases, documents retrieved from the Web, as well as data that is locally materialized in warehouses or is generated by software. We consider several tasks to provide optimized and seamless integration of biomolecular data. Challenges to be addressed include capturing and representing source capabilities; developing a methodology to acquire and represent semantic knowledge and metadata about source contents, overlap in source contents, and access costs; and decision support to select sources and capabilities using cost based and semantic knowledge, and generating low cost query evaluation plans. (Also referenced as UMIACS-TR-2001-51

    Applying the UML and the Unified Process to the Design of Data Warehouses

    The design, development and deployment of a data warehouse (DW) is a complex, time consuming and prone to fail task. This is mainly due to the different aspects taking part in a DW architecture such as data sources, processes responsible for Extracting, Transforming and Loading (ETL) data into the DW, the modeling of the DW itself, specifying data marts from the data warehouse or designing end user tools. In the last years, different models, methods and techniques have been proposed to provide partial solutions to cover the different aspects of a data warehouse. Nevertheless, none of these proposals addresses the whole development process of a data warehouse in an integrated and coherent manner providing the same notation for the modeling of the different parts of a DW. In this paper, we propose a data warehouse development method, based on the Unified Modeling Language (UML) and the Unified Process (UP), which addresses the design and development of both the data warehouse back-stage and front-end. We use the extension mechanisms (stereotypes, tagged values and constraints) provided by the UML and we properly extend it in order to accurately model the different parts of a data warehouse (such as the modeling of the data sources, ETL processes or the modeling of the DW itself) by using the same notation. To the best of our knowledge, our proposal provides a seamless method for developing data warehouses. Finally, we apply our approach to a case study to show its benefit.This work has been partially supported by the METASIGN project (TIN2004-OO779) from the Spanish Ministry of Education and Science, by the DADASMECA project (GV05/220) from the Valencia Government, and by the DADS (PBC-05-QI 2-2) project from the Regional Science arid Technology Ministry of CastiIla-La Mancha (Spain)

    Reasoning with Data Flows and Policy Propagation Rules

    Data-oriented systems and applications are at the centre of current developments of the World Wide Web. In these scenarios, assessing what policies propagate from the licenses of data sources to the output of a given data-intensive system is an important problem. Both policies and data flows can be described with Semantic Web languages. Although it is possible to define Policy Propagation Rules (PPR) by associating policies to data flow steps, this activity results in a huge number of rules to be stored and managed. In a recent paper, we introduced strategies for reducing the size of a PPR knowledge base by using an ontology of the possible relations between data objects, the Datanode ontology, and applying the (A)AAAA methodology, a knowledge engineering approach that exploits Formal Concept Analysis (FCA). In this article, we investigate whether this reasoning is feasible and how it can be performed. For this purpose, we study the impact of compressing a rule base associated with an inference mechanism on the performance of the reasoning process. Moreover, we report on an extension of the (A)AAAA methodology that includes a coherency check algorithm, that makes this reasoning possible. We show how this compression, in addition to being beneficial to the management of the knowledge base, also has a positive impact on the performance and resource requirements of the reasoning process for policy propagation

    A framework for integrating DNA sequenced data

    The Human Genome Project generated vast amounts of DNA sequenced data scattered in disparate data sources in a variety of formats. Integrating biological data and extracting information held in DNA sequences are major ongoing tasks for biologists and software professionals. This thesis explored issues of finding, extracting, merging and synthesizing information from multiple disparate data sources containing DNA sequenced data, which is composed of 3 billion chemical building blocks of bases. We proposed a biological data integration framework based on typical usage patterns to simplify these issues for biologists. The framework uses a relational database management system at the backend, and provides techniques to extract, store, and manage the data. This framework was implemented, evaluated, and compared with existing biological data integration solutions

    Ontology-based data integration in EPNet: Production and distribution of food during the Roman Empire

    Semantic technologies are rapidly changing the historical research. Over the last decades, an immense amount of new quantifiable data have been accumulated, and made available in interchangeable formats, in social sciences and humanities, opening up new possibilities for solving old questions and posing new ones. This paper introduces a framework that eases the access of scholars to historical and cultural data about food production and commercial trade system during the Roman Empire, distributed across different data sources. The proposed approach relies on the Ontology-Based Data Access (OBDA) paradigm, where the different datasets are virtually integrated by a conceptual layer (an ontology) that provides to the user a clear point of access and a unified and unambiguous conceptual view