33 research outputs found

    Using schema transformation pathways for data lineage tracing

    Get PDF
    With the increasing amount and diversity of information available on the Internet, there has been a huge growth in information systems that need to integrate data from distributed, heterogeneous data sources. Tracing the lineage of the integrated data is one of the problems being addressed in data warehousing research. This paper presents a data lineage tracing approach based on schema transformation pathways. Our approach is not limited to one specific data model or query language, and would be useful in any data transformation/integration framework based on sequences of primitive schema transformations

    Data access and integration in the ISPIDER proteomics grid

    Get PDF
    Grid computing has great potential for supporting the integration of complex, fast changing biological data repositories to enable distributed data analysis. One scenario where Grid computing has such potential is provided by proteomics resources which are rapidly being developed with the emergence of affordable, reliable methods to study the proteome. The protein identifications arising from these methods derive from multiple repositories which need to be integrated to enable uniform access to them. A number of technologies exist which enable these resources to be accessed in a Grid environment, but the independent development of these resources means that significant data integration challenges, such as heterogeneity and schema evolution, have to be met. This paper presents an architecture which supports the combined use of Grid data access (OGSA-DAI), Grid distributed querying (OGSA-DQP) and data integration (AutoMed) software tools to support distributed data analysis. We discuss the application of this architecture for the integration of several autonomous proteomics data resources

    Intersection schemas as a dataspace integration technique

    Get PDF
    This paper introduces the concept of Intersection Schemas in the field of heterogeneous data integration and dataspaces. We introduce a technique for incrementally integrating heterogeneous data sources by specifying semantic overlaps between sets of extensional schemas using bidirectional schema transformations, and automatically combining them into a global schema at each iteration of the integration process. We propose an incremental data integration methodology that uses this technique and that aims to reduce the amount of up-front effort required. Such approaches to data integration are often described as pay-as-you-go. A demonstrator of our technique is described, which utilizes a new graphical user tool implemented using the AutoMed heterogeneous data integration system. A case study is also described, and our technique and integration methodology are compared with a classical data integration strategy

    A framework for an Integrated Mining of Heterogeneous data in decision support systems

    Get PDF
    The volume of information available on the Internet and corporate intranets continues to increase along with the corresponding increase in the data (structured and unstructured) stored by many organizations. Over the past years, data mining techniques have been used to explore large volume of data (structured) in order to discover knowledge, often in form of a decision support system. For effective decision making, there is need to discover knowledge from both structured and unstructured data for completeness and comprehensiveness. The aim of this paper is to present a framework to discover this kind of knowledge and to present a report on the work-in-progress on an on going research work. The proposed framework is composed of three basic phases: extraction and integration, data mining and finally the relevance of such a system to the business decision support system. In the first phase, both the structured and unstructured data are combined to form an XML database (combined data warehouse (CDW)). Efficiency is enhanced by clustering of unstructured data (documents) using SOM (Self Organized Maps) clustering algorithm, extracting keyphrases based on training and TF/IDF (Term Frequency/Inverse Document Frequency) by using the KEA (Keyphrases Extraction Algorithm) toolkit. In the second phase, association rule mining technique is applied to discover knowledge from the combined data warehouse. The final phase reflects the changes that such a system will bring about to the marketing decision support system. The paper also describes a developed system which evaluates the association rules mined from structured data that forms the first phase of the research work. The proposed system is expected to improve the quality of decisions, and this will be evaluated by using standard metrics for evaluating the interestingness of association rule which is based on statistical independence and correlation analysis

    A spatial mediator model for integrating heterogeneous spatial data

    Get PDF
    The complexity and richness of geospatial data create specific problems in heterogeneous data integration. To deal with this type of data integration, we propose a spatial mediator embedded in a large distributed mobile environment (GeoGrid). The spatial mediator takes a user request from a field application and uses the request to select the appropriate data sources, constructs subqueries for the selected data sources, defines the process of combining the results from the subqueries, and develop an integration script that controls the integration process in order to respond to the request. The spatial mediator uses ontologies to support search for both geographic location based on symbolic terms as well as providing a term-based index to spatial data sources based on the relational model. In our approach, application designers only need to be aware of a minimum amount about the queries needed to supply users with the required data. The key part of this research has been the development of the spatial mediator that can dynamically respond to requests within the GeoGrid environment for geographic maps and related relational spatial data

    The Origin of Data: Enabling the Determination of Provenance in Multi-institutional Scientific Systems through the Documentation of Processes

    Get PDF
    The Oxford English Dictionary defines provenance as (i) the fact of coming from some particular source or quarter; origin, derivation. (ii) the history or pedigree of a work of art, manuscript, rare book, etc.; concr., a record of the ultimate derivation and passage of an item through its various owners. In art, knowing the provenance of an artwork lends weight and authority to it while providing a context for curators and the public to understand and appreciate the work’s value. Without such a documented history, the work may be misunderstood, unappreciated, or undervalued. In computer systems, knowing the provenance of digital objects would provide them with greater weight, authority, and context just as it does for works of art. Specifically, if the provenance of digital objects could be determined, then users could understand how documents were produced, how simulation results were generated, and why decisions were made. Provenance is of particular importance in science, where experimental results are reused, reproduced, and verified. However, science is increasingly being done through large-scale collaborations that span multiple institutions, which makes the problem of determining the provenance of scientific results significantly harder. Current approaches to this problem are not designed specifically for multi-institutional scientific systems and their evolution towards greater dynamic and peer-to-peer topologies. Therefore, this thesis advocates a new approach, namely, that through the autonomous creation, scalable recording, and principled organisation of documentation of systems’ processes, the determination of the provenance of results produced by complex multi-institutional scientific systems is enabled. The dissertation makes four contributions to the state of the art. First is the idea that provenance is a query performed over documentation of a system’s past process. Thus, the problem is one of how to collect and collate documentation from multiple distributed sources and organise it in a manner that enables the provenance of a digital object to be determined. Second is an open, generic, shared, principled data model for documentation of processes, which enables its collation so that it provides high-quality evidence that a system’s processes occurred. Once documentation has been created, it is recorded into specialised repositories called provenance stores using a formally specified protocol, which ensures documentation has high-quality characteristics. Furthermore, patterns and techniques are given to permit the distributed deployment of provenance stores. The protocol and patterns are the third contribution. The fourth contribution is a characterisation of the use of documentation of process to answer questions related to the provenance of digital objects and the impact recording has on application performance. Specifically, in the context of a bioinformatics case study, it is shown that six different provenance use cases are answered given an overhead of 13% on experiment run-time. Beyond the case study, the solution has been applied to other applications including fault tolerance in service-oriented systems, aerospace engineering, and organ transplant management

    A semantic and agent-based approach to support information retrieval, interoperability and multi-lateral viewpoints for heterogeneous environmental databases

    Get PDF
    PhDData stored in individual autonomous databases often needs to be combined and interrelated. For example, in the Inland Water (IW) environment monitoring domain, the spatial and temporal variation of measurements of different water quality indicators stored in different databases are of interest. Data from multiple data sources is more complex to combine when there is a lack of metadata in a computation forin and when the syntax and semantics of the stored data models are heterogeneous. The main types of information retrieval (IR) requirements are query transparency and data harmonisation for data interoperability and support for multiple user views. A combined Semantic Web based and Agent based distributed system framework has been developed to support the above IR requirements. It has been implemented using the Jena ontology and JADE agent toolkits. The semantic part supports the interoperability of autonomous data sources by merging their intensional data, using a Global-As-View or GAV approach, into a global semantic model, represented in DAML+OIL and in OWL. This is used to mediate between different local database views. The agent part provides the semantic services to import, align and parse semantic metadata instances, to support data mediation and to reason about data mappings during alignment. The framework has applied to support information retrieval, interoperability and multi-lateral viewpoints for four European environmental agency databases. An extended GAV approach has been developed and applied to handle queries that can be reformulated over multiple user views of the stored data. This allows users to retrieve data in a conceptualisation that is better suited to them rather than to have to understand the entire detailed global view conceptualisation. User viewpoints are derived from the global ontology or existing viewpoints of it. This has the advantage that it reduces the number of potential conceptualisations and their associated mappings to be more computationally manageable. Whereas an ad hoc framework based upon conventional distributed programming language and a rule framework could be used to support user views and adaptation to user views, a more formal framework has the benefit in that it can support reasoning about the consistency, equivalence, containment and conflict resolution when traversing data models. A preliminary formulation of the formal model has been undertaken and is based upon extending a Datalog type algebra with hierarchical, attribute and instance value operators. These operators can be applied to support compositional mapping and consistency checking of data views. The multiple viewpoint system was implemented as a Java-based application consisting of two sub-systems, one for viewpoint adaptation and management, the other for query processing and query result adjustment
    corecore