102 research outputs found

    An object query language for multimedia federations

    Get PDF
    The Fischlar system provides a large centralised repository of multimedia files. As expansion is difficult in centralised systems and as different user groups have a requirement to define their own schemas, the EGTV (Efficient Global Transactions for Video) project was established to examine how the distribution of this database could be managed. The federated database approach is advocated where global schema is designed in a top-down approach, while all multimedia and textual data is stored in object-oriented (O-O) and object-relational (0-R) compliant databases. This thesis investigates queries and updates on large multimedia collections organised in the database federation. The goal of this research is to provide a generic query language capable of interrogating global and local multimedia database schemas. Therefore, a new query language EQL is defined to facilitate the querying of object-oriented and objectrelational database schemas in a database and platform independent manner, and acts as a canonical language for database federations. A new canonical language was required as the existing query language standards (SQL: 1999 and OQL) axe generally incompatible and translation between them is not trivial. EQL is supported with a formally defined object algebra and specified semantics for query evaluation. The ability to capture and store metadata of multiple database schemas is essential when constructing and querying a federated schema. Therefore we also present a new platform independent metamodel for specifying multimedia schemas stored in both object-oriented and object-relational databases. This metadata information is later used for the construction of a global schemas, and during the evaluation of local and global queries. Another important feature of any federated system is the ability to unambiguously define database schemas. The schema definition language for an EGTV database federation must be capable of specifying both object-oriented and object-relational schemas in the database independent format. As XML represents a standard for encoding and distributing data across various platforms, a language based upon XML has been developed as a part of our research. The ODLx (Object Definition Language XML) language specifies a set of XMLbased structures for defining complex database schemas capable of representing different multimedia types. The language is fully integrated with the EGTV metamodel through which ODLx schemas can be mapped to 0-0 and 0-R databases

    BioWarehouse: a bioinformatics database warehouse toolkit

    Get PDF
    BACKGROUND: This article addresses the problem of interoperation of heterogeneous bioinformatics databases. RESULTS: We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. CONCLUSION: BioWarehouse embodies significant progress on the database integration problem for bioinformatics

    Integrated data management for RODOS

    Get PDF

    Architecture for integrating heterogeneous biological data repositories using ontologies

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (p. 86-89).High-throughput experiments generate vast quantities of biological information that are stored in autonomous data repositories distributed across the World Wide Web. There exists a need to integrate information from multiple data repositories for the purposes of data mining; however, current methods of integration require a significant amount of manual work that is often tedious and time consuming. The thesis proposes a flexible architecture that facilitates the automation of data integration from multiple heterogeneous biological data repositories using ontologies. The design uses ontologies to resolve the semantic conflicts that usually hinder schema integration and searching for information. The architecture implemented successfully demonstrates how ontologies facilitate the automation of data integration from multiple data repositories. Nevertheless, many optimizations to increase the performance of the system were realized during the implementation of various components in the architecture and are described in the thesis.by Howard H. Chou.M.Eng

    A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage

    No full text
    Today many application domains, such as national statistics, healthcare, business analytic, fraud detection, and national security, require data to be integrated from multiple databases. Record linkage (RL) is a process used in data integration which links multiple databases to identify matching records that belong to the same entity. RL enriches the usefulness of data by removing duplicates, errors, and inconsistencies which improves the effectiveness of decision making in data analytic applications. Often, organisations are not willing or authorised to share the sensitive information in their databases with any other party due to privacy and confidentiality regulations. The linkage of databases of different organisations is an emerging research area known as privacy-preserving record linkage (PPRL). PPRL facilitates the linkage of databases by ensuring the privacy of the entities in these databases. In multidatabase (MD) context, PPRL is significantly challenged by the intrinsic exponential growth in the number of potential record pair comparisons. Such linkage often requires significant time and computational resources to produce the resulting matching sets of records. Due to increased risk of collusion, preserving the privacy of the data is more problematic with an increase of number of parties involved in the linkage process. Blocking is commonly used to scale the linkage of large databases. The aim of blocking is to remove those record pairs that correspond to non-matches (refer to different entities). Many techniques have been proposed for RL and PPRL for blocking two databases. However, many of these techniques are not suitable for blocking multiple databases. This creates a need to develop blocking technique for the multidatabase linkage context as real-world applications increasingly require more than two databases. This thesis is the first to conduct extensive research on blocking for multidatabase privacy-preserved record linkage (MD-PPRL). We consider several research problems in blocking of MD-PPRL. First, we start with a broad background literature on PPRL. This allow us to identify the main research gaps that need to be investigated in MD-PPRL. Second, we introduce a blocking framework for MD-PPRL which provides more flexibility and control to database owners in the block generation process. Third, we propose different techniques that are used in our framework for (1) blocking of multiple databases, (2) identifying blocks that need to be compared across subgroups of these databases, and (3) filtering redundant record pair comparisons by the efficient scheduling of block comparisons to improve the scalability of MD-PPRL. Each of these techniques covers an important aspect of blocking in real-world MD-PPRL applications. Finally, this thesis reports on an extensive evaluation of the combined application of these methods with real datasets, which illustrates that they outperform existing approaches in term of scalability, accuracy, and privacy

    An Object-Oriented Heterogeneous Database Architecture

    Full text link
    Many data management environments face a critical need to integrate heterogeneous data-data that are stored in varying locations using various data management systems with diverse data formats and schemas. To address this problem, the database research community has developed the concept of a heterogeneous database system (HDB) that provides users with the illusion of a single unified database. However, HDBs rely on the implicit assumption that all data to be integrated into the HDB are stored in full-fledged database management systems (DBMS). This assumption leaves environments that need to integrate non-DBMS data unserved by HDB systems. Furthermore, HDBs are complex software solutions that are not easily lmplementable by database developers wrestling with heterogeneous data. This thesis presents a new, easily implemented HDB architecture that is suitable for integrating non-DBMS data. The key to our architecture is using an object-oriented database management system (OODBMS) as an implementation tool. Rather than developing an HDB from scratch, we leverage the power and facilities of the underlying OODBMS to provide a query language, application programmer interface, interactive query interface, concurrency control, etc. Using object-oriented technology gives us an additional benefit-our HDB becomes an object-oriented HDB (OOHDB) providing users with greater data model expressivity along with a powerful behavioral component. The OOHDB architecture we present is independent of a particular OODBMS and can be implemented using a number of commercial OODBMSs for a variety of data management environments. We describe one implementation of our architecture using the GemStone OODBMS for accessing heterogeneous materials science data. This implementation demonstrates how easily the architecture can be implemented. We use this implementation to analyze the performance of the architecture and examine the effectiveness of strategies for enhancing performance. We conclude that for many environments with heterogeneous non-DBMS data, our OOHDB architecture provides a good solution that is easy to implement using commercial OODBMS technology

    A procedure for mediation of queries to sources in disparate contexts

    Get PDF
    Includes bibliographical references (p. 17-19).S. Bressan ... [et al.]

    Efficient similarity-based operations for data integration

    Get PDF
    Similarity-based operations, similarity join, similarity grouping, data integrationMagdeburg, Univ., Fak. für Informatik, Diss., 2004von Eike Schalleh

    Integration of Heterogeneous Databases: Discovery of Meta-Information and Maintenance of Schema-Restructuring Views

    Get PDF
    In today\u27s networked world, information is widely distributed across many independent databases in heterogeneous formats. Integrating such information is a difficult task and has been adressed by several projects. However, previous integration solutions, such as the EVE-Project, have several shortcomings. Database contents and structure change frequently, and users often have incomplete information about the data content and structure of the databases they use. When information from several such insufficiently described sources is to be extracted and integrated, two problems have to be solved: How can we discover the structure and contents of and interrelationships among unknown databases, and how can we provide durable integration views over several such databases? In this dissertation, we have developed solutions for those key problems in information integration. The first part of the dissertation addresses the fact that knowledge about the interrelationships between databases is essential for any attempt at solving the information integration problem. We are presenting an algorithm called FIND2 based on the clique-finding problem in graphs and k-uniform hypergraphs to discover redundancy relationships between two relations. Furthermore, the algorithm is enhanced by heuristics that significantly reduce the search space when necessary. Extensive experimental studies on the algorithm both with and without heuristics illustrate its effectiveness on a variety of real-world data sets. The second part of the dissertation addresses the durable view problem and presents the first algorithm for incremental view maintenance in schema-restructuring views. Such views are essential for the integration of heterogeneous databases. They are typically defined in schema-restructuring query languages like SchemaSQL, which can transform schema into data and vice versa, making traditional view maintenance based on differential queries impossible. Based on an existing algebra for SchemaSQL, we present an update propagation algorithm that propagates updates along the query algebra tree and prove its correctness. We also propose optimizations on our algorithm and present experimental results showing its benefits over view recomputation
    corecore