51 research outputs found

    A Data Transformation System for Biological Data Sources

    Get PDF
    Scientific data of importance to biologists in the Human Genome Project resides not only in conventional databases, but in structured files maintained in a number of different formats (e.g. ASN.1 and ACE) as well a.s sequence analysis packages (e.g. BLAST and FASTA). These formats and packages contain a number of data types not found in conventional databases, such as lists and variants, and may be deeply nested. We present in this paper techniques for querying and transforming such data, and illustrate their use in a prototype system developed in conjunction with the Human Genome Center for Chromosome 22. We also describe optimizations performed by the system, a crucial issue for bulk data

    GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining

    Get PDF
    Statisticalandclustering analyses ofgeneexpression results from high-density microarray experiments produce lists of hundreds of genes regulated differentially, or with particular expression profiles, in the conditions under study. Independent of the microarray platforms and analysis methods used, these lists must be biologically interpreted to gain a better knowledge of the patho-physiological phenomena involved. To this end, numerous biological annotations are available within heterogeneous and widely distributed databases. Although several tools have been developed for annotating lists of genes, most of them do not give methods for evaluating the relevance of the annotations provided, or for estimating the functional bias introduced by the gene set on the array used to identify the gene list considered. We developed Genome Functional INtegrated Discoverer (GFINDer ), a web server able to automatically provide large-scale lists of user-classified genes with functional profiles biologically characterizing the different gene classes in the list. GFINDer automatically retrieves annotations of several functional categories from different sources, identifies the categories enriched in each class of a user-classified gene list and calculates statistical significance values for each category. Moreover, GFINDer enables the functional classification of genes according to mined functional categories and the statistical analysis is of the classifications obtained, aiding better interpretationof microarray experiment results. GFINDer is available online at http://www.medinfopoli.polimi.it/GFINDer/

    A cooperative framework for molecular biology database integration using image object selection

    Get PDF
    The theme and the concept of 'Molecular Biology Database Integration' and the problems associated with this concept initiated the idea for this Ph.D research. The available technologies facilitate to analyse the data independently and discretely but it fails to integrate the data resources for more meaningful information. This along with the integration issues created the scope for this Ph.D research. The research has reviewed the 'database interoperability' problems and it has suggested a framework for integrating the molecular biology databases. The framework has proposed to develop a cooperative environment to share information on the basis of common purpose for the molecular biology databases. The research has also reviewed other implementation and interoperability issues for laboratory based, dedicated and target specific database. The research has addressed the following issues: diversity of molecular biology databases schemas, schema constructs and schema implementation multi-database query using image object keying, database integration technologies using context graph, automated navigation among these databases. This thesis has introduced a new approach for database implementation. It has introduced an interoperable component database concept to initiate multidatabase query on gene mutation data. A number of data models have been proposed for gene mutation data which is the basis for integrating the target specific component database to be integrated with the federated information system. The proposed data models are: data models for genetic trait analysis, classification of gene mutation data, pathological lesion data and laboratory data. The main feature of this component database is non-overlapping attributes and it will follow non-redundant integration approach as explained in the thesis. This will be achieved by storing attributes which will not have the union or intersection of any attributes that exist in public domain molecular biology databases. Unlike data warehousing technique, this feature is quite unique and novel. The component database will be integrated with other biological data sources for sharing information in a cooperative environment. This involves developing new tools. The thesis explains the role of these new tools which are: meta data extractor, mapping linker, query generator and result interpreter. These tools are used for a transparent integration without creating any global schema of the participating databases. The thesis has also established the concept of image object keying for multidatabase query and it has proposed a relevant algorithm for matching protein spot in gel electrophoresis image. An object spot in gel electrophoresis image will initiate the query when it is selected by the user. It matches the selected spot with other similar spots in other resource databases. This image object keying method is an alternative to conventional multidatabase query which requires writing complex SQL scripts. This method also resolve the semantic conflicts that exist among molecular biology databases. The research has proposed a new framework based on the context of the web data for interactions with different biological data resources. A formal description of the resource context is described in the thesis. The implementation of the context into Resource Document Framework (RDF) will be able to increase the interoperability by providing the description of the resources and the navigation plan for accessing the web based databases. A higher level construct is developed (has, provide and access) to implement the context into RDF for web interactions. The interactions within the resources are achieved by utilising an integration domain to extract the required information with a single instance and without writing any query scripts. The integration domain allows to navigate and to execute the query plan within the resource databases. An extractor module collects elements from different target webs and unify them as a whole object in a single page. The proposed framework is tested to find specific information e.g., information on Alzheimer's disease, from public domain biology resources, such as, Protein Data Bank, Genome Data Bank, Online Mendalian Inheritance in Man and local database. Finally, the thesis proposes further propositions and plans for future work

    A cooperative framework for molecular biology database integration using image object selection.

    Get PDF
    The theme and the concept of 'Molecular Biology Database Integration’ and the problems associated with this concept initiated the idea for this Ph.D research. The available technologies facilitate to analyse the data independently and discretely but it fails to integrate the data resources for more meaningful information. This along with the integration issues created the scope for this Ph.D research. The research has reviewed the 'database interoperability' problems and it has suggested a framework for integrating the molecular biology databases. The framework has proposed to develop a cooperative environment to share information on the basis of common purpose for the molecular biology databases. The research has also reviewed other implementation and interoperability issues for laboratory based, dedicated and target specific database. The research has addressed the following issues: - diversity of molecular biology databases schemas, schema constructs and schema implementation -multi-database query using image object keying -database integration technologies using context graph - automated navigation among these databases This thesis has introduced a new approach for database implementation. It has introduced an interoperable component database concept to initiate multidatabase query on gene mutation data. A number of data models have been proposed for gene mutation data which is the basis for integrating the target specific component database to be integrated with the federated information system. The proposed data models are: data models for genetic trait analysis, classification of gene mutation data, pathological lesion data and laboratory data. The main feature of this component database is non-overlapping attributes and it will follow non-redundant integration approach as explained in the thesis. This will be achieved by storing attributes which will not have the union or intersection of any attributes that exist in public domain molecular biology databases. Unlike data warehousing technique, this feature is quite unique and novel. The component database will be integrated with other biological data sources for sharing information in a cooperative environment. This/involves developing new tools. The thesis explains the role of these new tools which are: meta data extractor, mapping linker, query generator and result interpreter. These tools are used for a transparent integration without creating any global schema of the participating databases. The thesis has also established the concept of image object keying for multidatabase query and it has proposed a relevant algorithm for matching protein spot in gel electrophoresis image. An object spot in gel electrophoresis image will initiate the query when it is selected by the user. It matches the selected spot with other similar spots in other resource databases. This image object keying method is an alternative to conventional multidatabase query which requires writing complex SQL scripts. This method also resolve the semantic conflicts that exist among molecular biology databases. The research has proposed a new framework based on the context of the web data for interactions with different biological data resources. A formal description of the resource context is described in the thesis. The implementation of the context into Resource Document Framework (RDF) will be able to increase the interoperability by providing the description of the resources and the navigation plan for accessing the web based databases. A higher level construct is developed (has, provide and access) to implement the context into RDF for web interactions. The interactions within the resources are achieved by utilising an integration domain to extract the required information with a single instance and without writing any query scripts. The integration domain allows to navigate and to execute the query plan within the resource databases. An extractor module collects elements from different target webs and unify them as a whole object in a single page. The proposed framework is tested to find specific information e.g., information on Alzheimer's disease, from public domain biology resources, such as, Protein Data Bank, Genome Data Bank, Online Mendalian Inheritance in Man and local database. Finally, the thesis proposes further propositions and plans for future work

    A transcriptional sketch of a primary human breast cancer by 454 deep sequencing

    Get PDF
    Background: The cancer transcriptome is difficult to explore due to the heterogeneity of quantitative and qualitative changes in gene expression linked to the disease status. An increasing number of "unconventional" transcripts, such as novel isoforms, non-coding RNAs, somatic gene fusions and deletions have been associated with the tumoral state. Massively parallel sequencing techniques provide a framework for exploring the transcriptional complexity inherent to cancer with a limited laboratory and financial effort. We developed a deep sequencing and bioinformatics analysis protocol to investigate the molecular composition of a breast cancer poly(A)+ transcriptome. This method utilizes a cDNA library normalization step to diminish the representation of highly expressed transcripts and biology-oriented bioinformatic analyses to facilitate detection of rare and novel transcripts. Results: We analyzed over 132,000 Roche 454 high-confidence deep sequencing reads from a primary human lobular breast cancer tissue specimen, and detected a range of unusual transcriptional events that were subsequently validated by RT-PCR in additional eight primary human breast cancer samples. We identified and validated one deletion, two novel ncRNAs (one intergenic and one intragenic), ten previously unknown or rare transcript isoforms and a novel gene fusion specific to a single primary tissue sample. We also explored the non-protein-coding portion of the breast cancer transcriptome, identifying thousands of novel non-coding transcripts and more than three hundred reads corresponding to the non-coding RNA MALAT1, which is highly expressed in many human carcinomas. Conclusion: Our results demonstrate that combining 454 deep sequencing with a normalization step and careful bioinformatic analysis facilitates the discovery and quantification of rare transcripts or ncRNAs, and can be used as a qualitative tool to characterize transcriptome complexity, revealing many hitherto unknown transcripts, splice isoforms, gene fusion events and ncRNAs, even at a relatively low sequence sampling

    GenMAPP 2: New features and resources for pathway analysis

    Get PDF
    BACKGROUND: Microarray technologies have evolved rapidly, enabling biologists to quantify genome-wide levels of gene expression, alternative splicing, and sequence variations for a variety of species. Analyzing and displaying these data present a significant challenge. Pathway-based approaches for analyzing microarray data have proven useful for presenting data and for generating testable hypotheses. RESULTS: To address the growing needs of the microarray community we have released version 2 of Gene Map Annotator and Pathway Profiler (GenMAPP), a new GenMAPP database schema, and integrated resources for pathway analysis. We have redesigned the GenMAPP database to support multiple gene annotations and species as well as custom species database creation for a potentially unlimited number of species. We have expanded our pathway resources by utilizing homology information to translate pathway content between species and extending existing pathways with data derived from conserved protein interactions and coexpression. We have implemented a new mode of data visualization to support analysis of complex data, including time-course, single nucleotide polymorphism (SNP), and splicing. GenMAPP version 2 also offers innovative ways to display and share data by incorporating HTML export of analyses for entire sets of pathways as organized web pages. CONCLUSION: GenMAPP version 2 provides a means to rapidly interrogate complex experimental data for pathway-level changes in a diverse range of organisms

    The Bermuda Triangle : the pragmatics, policies, and principles for data sharing in the history of the Human Genome Project

    Get PDF
    © The Author(s), 2018. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Journal of the History of Biology 51 (2018): 693–805, doi:10.1007/s10739-018-9538-7.The Bermuda Principles for DNA sequence data sharing are an enduring legacy of the Human Genome Project (HGP). They were adopted by the HGP at a strategy meeting in Bermuda in February of 1996 and implemented in formal policies by early 1998, mandating daily release of HGP-funded DNA sequences into the public domain. The idea of daily sharing, we argue, emanated directly from strategies for large, goal-directed molecular biology projects first tested within the “community” of C. elegans researchers, and were introduced and defended for the HGP by the nematode biologists John Sulston and Robert Waterston. In the C. elegans community, and subsequently in the HGP, daily sharing served the pragmatic goals of quality control and project coordination. Yet in the HGP human genome, we also argue, the Bermuda Principles addressed concerns about gene patents impeding scientific advancement, and were aspirational and flexible in implementation and justification. They endured as an archetype for how rapid data sharing could be realized and rationalized, and permitted adaptation to the needs of various scientific communities. Yet in addition to the support of Sulston and Waterston, their adoption also depended on the clout of administrators at the US National Institutes of Health (NIH) and the UK nonprofit charity the Wellcome Trust, which together funded 90% of the HGP human sequencing effort. The other nations wishing to remain in the HGP consortium had to accommodate to the Bermuda Principles, requiring exceptions from incompatible existing or pending data access policies for publicly funded research in Germany, Japan, and France. We begin this story in 1963, with the biologist Sydney Brenner’s proposal for a nematode research program at the Laboratory of Molecular Biology (LMB) at the University of Cambridge. We continue through 2003, with the completion of the HGP human reference genome, and conclude with observations about policy and the historiography of molecular biology
    corecore