292 research outputs found

    Extending pathways based on gene lists using InterPro domain signatures

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput technologies like functional screens and gene expression analysis produce extended lists of candidate genes. Gene-Set Enrichment Analysis is a commonly used and well established technique to test for the statistically significant over-representation of particular pathways. A shortcoming of this method is however, that most genes that are investigated in the experiments have very sparse functional or pathway annotation and therefore cannot be the target of such an analysis. The approach presented here aims to assign lists of genes with limited annotation to previously described functional gene collections or pathways. This works by comparing InterPro domain signatures of the candidate gene lists with domain signatures of gene sets derived from known classifications, e.g. KEGG pathways.</p> <p>Results</p> <p>In order to validate our approach, we designed a simulation study. Based on all pathways available in the KEGG database, we create test gene lists by randomly selecting pathway genes, removing these genes from the known pathways and adding variable amounts of noise in the form of genes not annotated to the pathway. We show that we can recover pathway memberships based on the simulated gene lists with high accuracy. We further demonstrate the applicability of our approach on a biological example.</p> <p>Conclusion</p> <p>Results based on simulation and data analysis show that domain based pathway enrichment analysis is a very sensitive method to test for enrichment of pathways in sparsely annotated lists of genes. An R based software package <it>domainsignatures</it>, to routinely perform this analysis on the results of high-throughput screening, is available via Bioconductor.</p

    Predicting pathway membership via domain signatures

    Get PDF
    Motivation: Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database

    PhenoFam-gene set enrichment analysis through protein structural information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With the current technological advances in high-throughput biology, the necessity to develop tools that help to analyse the massive amount of data being generated is evident. A powerful method of inspecting large-scale data sets is gene set enrichment analysis (GSEA) and investigation of protein structural features can guide determining the function of individual genes. However, a convenient tool that combines these two features to aid in high-throughput data analysis has not been developed yet. In order to fill this niche, we developed the user-friendly, web-based application, PhenoFam.</p> <p>Results</p> <p>PhenoFam performs gene set enrichment analysis by employing structural and functional information on families of protein domains as annotation terms. Our tool is designed to analyse complete sets of results from quantitative high-throughput studies (gene expression microarrays, functional RNAi screens, <it>etc</it>.) without prior pre-filtering or hits-selection steps. PhenoFam utilizes Ensembl databases to link a list of user-provided identifiers with protein features from the InterPro database, and assesses whether results associated with individual domains differ significantly from the overall population. To demonstrate the utility of PhenoFam we analysed a genome-wide RNA interference screen and discovered a novel function of plexins containing the cytoplasmic RasGAP domain. Furthermore, a PhenoFam analysis of breast cancer gene expression profiles revealed a link between breast carcinoma and altered expression of PX domain containing proteins.</p> <p>Conclusions</p> <p>PhenoFam provides a user-friendly, easily accessible web interface to perform GSEA based on high-throughput data sets and structural-functional protein information, and therefore aids in functional annotation of genes.</p

    Identification of differentially expressed genes from multipotent epithelia at the onset of an asexual development

    Get PDF
    © The Author(s), 2016. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Scientific Reports 6 (2016): 27357, doi:10.1038/srep27357.Organisms that have evolved alternative modes of reproduction, complementary to the sexual mode, are found across metazoans. The chordate Botryllus schlosseri is an emerging model for asexual development studies. Botryllus can rebuild its entire body from a portion of adult epithelia in a continuous and stereotyped process called blastogenesis. Anatomy and ontogenies of blastogenesis are well described, however molecular signatures triggering this developmental process are entirely unknown. We isolated tissues at the site of blastogenesis onset and from the same epithelia where this process is never triggered. We linearly amplified an ultra-low amount of mRNA (<10ng) and generated three transcriptome datasets. To provide a conservative landscape of transcripts differentially expressed between blastogenic vs. non-blastogenic epithelia we compared three different mapping and analysis strategies with a de novo assembled transcriptome and partially assembled genome as references, additionally a self-mapping strategy on the dataset. A subset of differentially expressed genes were analyzed and validated by in situ hybridization. The comparison of different analyses allowed us to isolate stringent sets of target genes, including transcripts with potential involvement in the onset of a non-embryonic developmental pathway. The results provide a good entry point to approach regenerative event in a basal chordate.This work was supported by AFM Telethon grant (#16611), IRG Marie Curie grant (#276974), ANR (ANR-14-CE02-0019-01) and IDEX Super (INDIBIO). L.R. was supported by an UPMC-EMREGENCE grant and by a FRM grant (#FDT20140931163). A.C. was supported by a FRM grant (ING 20140129231)

    ReprOlive: a database with linked data for the olive tree (Olea europaea L.) reproductive transcriptome

    Get PDF
    Plant reproductive transcriptomes have been analyzed in different species due to the agronomical and biotechnological importance of plant reproduction. Here we presented an olive tree reproductive transcriptome database with samples from pollen and pistil at different developmental stages, and leaf and root as control vegetative tissues (http://reprolive.eez.csic.es). It was developed from 2,077,309 raw reads to 1,549 Sanger sequences. Using a pre-defined workflow based on open-source tools, sequences were pre-processed, assembled, mapped, and annotated with expression data, descriptions, GO terms, InterPro signatures, EC numbers, KEGG pathways, ORFs, and SSRs. Tentative transcripts (TTs) were also annotated with the corresponding orthologs in Arabidopsis thaliana from TAIR and RefSeq databases to enable Linked Data integration. It results in a reproductive transcriptome comprising 72,846 contigs with average length of 686 bp, of which 63,965 (87.8%) included at least one functional annotation, and 55,356 (75.9%) had an ortholog. A minimum of 23,568 different TTs was identified and 5,835 of them contain a complete ORF. The representative reproductive transcriptome can be reduced to 28,972 TTs for further gene expression studies. Partial transcriptomes from pollen, pistil, and vegetative tissues as control were also constructed. ReprOlive provides free access and download capability to these results. Retrieval mechanisms for sequences and transcript annotations are provided. Graphical localization of annotated enzymes into KEGG pathways is also possible. Finally, ReprOlive has included a semantic conceptualisation by means of a Resource Description Framework (RDF) allowing a Linked Data search for extracting the most updated information related to enzymes, interactions, allergens, structures, and reactive oxygen species.This work was supported by co-funding from the ERDF (European Regional Development Fund) and (i) Ministerio de Ciencia e Innovación [grant numbers BFU2011-22779 to JDA, and TIN2011-25840 and TIN2014-58304-R to JFAM], and (ii) Plan Andaluz de Investigación, Desarrollo e Innovación [grant numbers P10-CVI-6075 to MGC, P10-AGR-6274 and P11-CVI-7487 to JDA, and P11-TIC-7529 and P12-TIC-1519 to JFAM].Peer reviewe

    BIOZON: a system for unification, management and analysis of heterogeneous biological data

    Get PDF
    BACKGROUND: Integration of heterogeneous data types is a challenging problem, especially in biology, where the number of databases and data types increase rapidly. Amongst the problems that one has to face are integrity, consistency, redundancy, connectivity, expressiveness and updatability. DESCRIPTION: Here we present a system (Biozon) that addresses these problems, and offers biologists a new knowledge resource to navigate through and explore. Biozon unifies multiple biological databases consisting of a variety of data types (such as DNA sequences, proteins, interactions and cellular pathways). It is fundamentally different from previous efforts as it uses a single extensive and tightly connected graph schema wrapped with hierarchical ontology of documents and relations. Beyond warehousing existing data, Biozon computes and stores novel derived data, such as similarity relationships and functional predictions. The integration of similarity data allows propagation of knowledge through inference and fuzzy searches. Sophisticated methods of query that span multiple data types were implemented and first-of-a-kind biological ranking systems were explored and integrated. CONCLUSION: The Biozon system is an extensive knowledge resource of heterogeneous biological data. Currently, it holds more than 100 million biological documents and 6.5 billion relations between them. The database is accessible through an advanced web interface that supports complex queries, "fuzzy" searches, data materialization and more, online at

    Integration and analysis of phenotypic data from functional screens

    Get PDF
    Motivation: Although various high-throughput technologies provide a lot of valuable information, each of them is giving an insight into different aspects of cellular activity and each has its own limitations. Thus, a complete and systematic understanding of the cellular machinery can be achieved only by a combined analysis of results coming from different approaches. However, methods and tools for integration and analysis of heterogenous biological data still have to be developed. Results: This work presents systemic analysis of basic cellular processes, i.e. cell viability and cell cycle, as well as embryonic stem cell pluripotency and differentiation. These phenomena were studied using several high-throughput technologies, whose combined results were analysed with existing and novel clustering and hit selection algorithms. This thesis also introduces two novel data management and data analysis tools. The first, called DSViewer, is a database application designed for integrating and querying results coming from various genome-wide experiments. The second, named PhenoFam, is an application performing gene set enrichment analysis by employing structural and functional information on families of protein domains as annotation terms. Both programs are accessible through a web interface. Conclusions: Eventually, investigations presented in this work provide the research community with novel and markedly improved repertoire of computational tools and methods that facilitate the systematic analysis of accumulated information obtained from high-throughput studies into novel biological insights

    AGeNNT: annotation of enzyme families by means of refined neighborhood networks

    Get PDF
    Background: Large enzyme families may contain functionally diverse members that give rise to clusters in a sequence similarity network (SSN). In prokaryotes, the genome neighborhood of a gene-product is indicative of its function and thus, a genome neighborhood network (GNN) deduced for an SSN provides strong clues to the specific function of enzymes constituting the different clusters. The Enzyme Function Initiative (http://enzymefunction.org/) offers services that compute SSNs and GNNs. Results: We have implemented AGeNNT that utilizes these services, albeit with datasets purged with respect to unspecific protein functions and overrepresented species. AGeNNT generates refined GNNs (rGNNs) that consist of cluster-nodes representing the sequences under study and Pfam-nodes representing enzyme functions encoded in the respective neighborhoods. For cluster-nodes, AGeNNT summarizes the phylogenetic relationships of the contributing species and a statistic indicates how unique nodes and GNs are within this rGNN. Pfam-nodes are annotated with additional features like GO terms describing protein function. For edges, the coverage is given, which is the relative number of neighborhoods containing the considered enzyme function (Pfam-node). AGeNNT is available at https://github.com/kandlinf/agennt. Conclusions: An rGNN is easier to interpret than a conventional GNN, which commonly contains proteins without enzymatic function and overly specific neighborhoods due to phylogenetic bias. The implemented filter routines and the statistic allow the user to identify those neighborhoods that are most indicative of a specific metabolic capacity. Thus, AGeNNT facilitates to distinguish and annotate functionally different members of enzyme families

    On combining collaborative and automated curation for enzyme function prediction

    Get PDF
    Grant number BB/F529038/1Data generation has vastly exceeded manual annotation in several areas of astronomy, biology, economy, geology, medicine and physics. At the same time, a public community of experts and hobbyists has developed around some of these disciplines thanks to open, editable web resources such as wikis and public annotation challenges. In this thesis I investigate under which conditions a combination of collaborative and automated curation could complete annotation tasks unattainable by human curators alone. My exemplar curation process is taken from the molecular biology domain: the association all existing enzymes (proteins catalysing a chemical reaction) with their function. Assigning enzymatic function to the proteins in a genome is the first essential problem of metabolic reconstruction, important for biology, medicine, industrial production and environmental studies. In the protein database UniProt, only 3% of the records are currently manually curated and only 60% of the 17 million recorded proteins have some functional annotation, including enzymatic annotation. The proteins in UniProt represent only about 380,000 animal species (2,000 of which have completely sequenced genomes) out of the estimated millions of species existing on earth. The enzyme annotation task already applies to millions of entries and this number is bound to increase rapidly as sequencing efforts intensify. To guide my analysis I first develop a basic model of collaborative curation and evaluate it against molecular biology knowledge bases. The analysis highlights a surprising similarity between open and closed annotation environments on metrics usually connected with “democracy” of content. I then develop and evaluate a method to enhance enzyme function annotation using machine learning which demonstrates very high accuracy, recall and precision and the capacity to scale to millions of enzyme instances. This method needs only a protein sequence as input and is thus widely applicable to genomic and metagenomic analysis. The last phase of the work uses active and guided learning to bring together collaborative and automatic curation. In active learning a machine learning algorithm suggests to the human curators which entry should be annotated next. This strategy has the potential to coordinate and reduce the amount of manual curation while improving classification performance and reducing the number of training instances needed. This work demonstrates the benefits of combining classic machine learning and guided learning to improve the quantity and quality of enzymatic knowledge and to bring us closer to the goal of annotating all existing enzymes
    corecore