151 research outputs found

    Multidatabase concurrency control

    Get PDF

    Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses

    Get PDF
    Background: The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. However, the semantic heterogeneity and annotation inconsistencies in biological databases greatly increase the complexity of aggregating and cleaning metadata. Manual curation of datasets, traditionally favoured by life scientists, is impractical for studies involving thousands of records. In this study, we investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules. We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation. Results: Over 40,000 annotated Influenza A protein sequences were collected by combining information from more than 90,000 documents from NCBI public databases. Metadata values were automatically extracted, aggregated and reconciled from several document fields by applying user-defined structural rules. For each property, values were recovered from ≥88.8% of records, with accuracy exceeding 96% in most cases. Because of semantic heterogeneity, each property required up to six different structural rules to be combined. Significant quality differences between databases were found: GenBank documents yield values more reliably than documents extracted from GenPept. Using a simple set of semantic rules and a reasoner, we reconstructed relationships between sequences from the same isolate, thus identifying 7640 isolates. Validation of isolate metadata against a simple ontology highlighted more than 400 inconsistencies, leading to over 3,000 property value corrections. Conclusion: To overcome the quality issues inherent in public databases, automated knowledge aggregation with embedded intelligence is needed for large-scale analyses. Our results show that user-controlled intuitive approaches, based on combination of simple rules, can reliably automate various curation tasks, reducing the need for manual corrections to approximately 5% of the records. Emerging semantic technologies possess desirable features to support today's knowledge aggregation tasks, with a potential to bring immediate benefits to this field

    Integrated data management for RODOS

    Get PDF

    A cooperative framework for molecular biology database integration using image object selection

    Get PDF
    The theme and the concept of 'Molecular Biology Database Integration' and the problems associated with this concept initiated the idea for this Ph.D research. The available technologies facilitate to analyse the data independently and discretely but it fails to integrate the data resources for more meaningful information. This along with the integration issues created the scope for this Ph.D research. The research has reviewed the 'database interoperability' problems and it has suggested a framework for integrating the molecular biology databases. The framework has proposed to develop a cooperative environment to share information on the basis of common purpose for the molecular biology databases. The research has also reviewed other implementation and interoperability issues for laboratory based, dedicated and target specific database. The research has addressed the following issues: - diversity of molecular biology databases schemas, schema constructs and schema implementation -multi-database query using image object keying -database integration technologies using context graph - automated navigation among these databases This thesis has introduced a new approach for database implementation. It has introduced an interoperable component database concept to initiate multidatabase query on gene mutation data. A number of data models have been proposed for gene mutation data which is the basis for integrating the target specific component database to be integrated with the federated information system. The proposed data models are: data models for genetic trait analysis, classification of gene mutation data, pathological lesion data and laboratory data. The main feature of this component database is non-overlapping attributes and it will follow non-redundant integration approach as explained in the thesis. This will be achieved by storing attributes which will not have the union or intersection of any attributes that exist in public domain molecular biology databases. Unlike data warehousing technique, this feature is quite unique and novel. The component database will be integrated with other biological data sources for sharing information in a cooperative environment. This/involves developing new tools. The thesis explains the role of these new tools which are: meta data extractor, mapping linker, query generator and result interpreter. These tools are used for a transparent integration without creating any global schema of the participating databases. The thesis has also established the concept of image object keying for multidatabase query and it has proposed a relevant algorithm for matching protein spot in gel electrophoresis image. An object spot in gel electrophoresis image will initiate the query when it is selected by the user. It matches the selected spot with other similar spots in other resource databases. This image object keying method is an alternative to conventional multidatabase query which requires writing complex SQL scripts. This method also resolve the semantic conflicts that exist among molecular biology databases. The research has proposed a new framework based on the context of the web data for interactions with different biological data resources. A formal description of the resource context is described in the thesis. The implementation of the context into Resource Document Framework (RDF) will be able to increase the interoperability by providing the description of the resources and the navigation plan for accessing the web based databases. A higher level construct is developed (has, provide and access) to implement the context into RDF for web interactions. The interactions within the resources are achieved by utilising an integration domain to extract the required information with a single instance and without writing any query scripts. The integration domain allows to navigate and to execute the query plan within the resource databases. An extractor module collects elements from different target webs and unify them as a whole object in a single page. The proposed framework is tested to find specific information e.g., information on Alzheimer's disease, from public domain biology resources, such as, Protein Data Bank, Genome Data Bank, Online Mendalian Inheritance in Man and local database. Finally, the thesis proposes further propositions and plans for future work.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Common Problems, Common Data Model Solutions: Evidence Generation for Health Technology Assessment

    Get PDF
    There is growing interest in using observational data to assess the safety, effectiveness, and cost effectiveness of medical technologies, but operational, technical, and methodological challenges limit its more widespread use. Common data models and federated data networks offer a potential solution to many of these problems. The open-source Observational and Medical Outcomes Partnerships (OMOP) common data model standardises the structure, format, and terminologies of otherwise disparate datasets, enabling the execution of common analytical code across a federated data network in which only code and aggregate results are shared. While common data models are increasingly used in regulatory decision making, relatively little attention has been given to their use in health technology assessment (HTA). We show that the common data model has the potential to facilitate access to relevant data, enable multidatabase studies to enhance statistical power and transfer results across populations and settings to meet the needs of local HTA decision makers, and validate findings. The use of open-source and standardised analytics improves transparency and reduces coding errors, thereby increasing confidence in the results. Further engagement from the HTA community is required to inform the appropriate standards for mapping data to the common data model and to design tools that can support evidence generation and decision making

    Common Problems, Common Data Model Solutions: Evidence Generation for Health Technology Assessment

    Get PDF
    There is growing interest in using observational data to assess the safety, effectiveness, and cost effectiveness of medical technologies, but operational, technical, and methodological challenges limit its more widespread use. Common data models and federated data networks offer a potential solution to many of these problems. The open-source Observational and Medical Outcomes Partnerships (OMOP) common data model standardises the structure, format, and terminologies of otherwise disparate datasets, enabling the execution of common analytical code across a federated data network in which only code and aggregate results are shared. While common data models are increasingly used in regulatory decision making, relatively little attention has been given to their use in health technology assessment (HTA). We show that the common data model has the potential to facilitate access to relevant data, enable multidatabase studies to enhance statistical power and transfer results across populations and settings to meet the needs of local HTA decision makers, and validate findings. The use of open-source and standardised analytics improves transparency and reduces coding errors, thereby increasing confidence in the results. Further engagement from the HTA community is required to inform the appropriate standards for mapping data to the common data model and to design tools that can support evidence generation and decision making
    • …
    corecore