4,038 research outputs found

    Statistical analysis of genomic protein family and domain controlled annotations for functional investigation of classified gene lists

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The increasing protein family and domain based annotations constitute important information to understand protein functions and gain insight into relations among their codifying genes. To allow analyzing of gene proteomic annotations, we implemented novel modules within <it>GFINDer</it>, a Web system we previously developed that dynamically aggregates functional and phenotypic annotations of user-uploaded gene lists and allows performing their statistical analysis and mining.</p> <p>Results</p> <p>Exploiting protein information in Pfam and InterPro databanks, we developed and added in <it>GFINDer </it>original modules specifically devoted to the exploration and analysis of functional signatures of gene protein products. They allow annotating numerous user-classified nucleotide sequence identifiers with controlled information on related protein families, domains and functional sites, classifying them according to such protein annotation categories, and statistically analyzing the obtained classifications. In particular, when uploaded nucleotide sequence identifiers are subdivided in classes, the <it>Statistics Protein Families&Domains </it>module allows estimating relevance of Pfam or InterPro controlled annotations for the uploaded genes by highlighting protein signatures significantly more represented within user-defined classes of genes. In addition, the <it>Logistic Regression </it>module allows identifying protein functional signatures that better explain the considered gene classification.</p> <p>Conclusion</p> <p>Novel <it>GFINDer </it>modules provide genomic protein family and domain analyses supporting better functional interpretation of gene classes, for instance defined through statistical and clustering analyses of gene expression results from microarray experiments. They can hence help understanding fundamental biological processes and complex cellular mechanisms influenced by protein domain composition, and contribute to unveil new biomedical knowledge about the codifying genes.</p

    Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release

    Get PDF
    BACKGROUND: Since the initial publication of its complete genome sequence, Arabidopsis thaliana has become more important than ever as a model for plant research. However, the initial genome annotation was submitted by multiple centers using inconsistent methods, making the data difficult to use for many applications. RESULTS: Over the course of three years, TIGR has completed its effort to standardize the structural and functional annotation of the Arabidopsis genome. Using both manual and automated methods, Arabidopsis gene structures were refined and gene products were renamed and assigned to Gene Ontology categories. We present an overview of the methods employed, tools developed, and protocols followed, summarizing the contents of each data release with special emphasis on our final annotation release (version 5). CONCLUSION: Over the entire period, several thousand new genes and pseudogenes were added to the annotation. Approximately one third of the originally annotated gene models were significantly refined yielding improved gene structure annotations, and every protein-coding gene was manually inspected and classified using Gene Ontology terms

    Bioinformatics tools for the genetic dissection of complex traits in chickens

    Get PDF
    This thesis explores the genetic characterization of the mechanisms underlying complex traits in chicken through the use and development of bioinformatics tools. The characterization of quantitative trait loci controlling complex traits has proven to be very challenging. This thesis comprises the study of experimental designs, annotation procedures and functional analyses. These represent some of the main ā€˜bottlenecksā€™ involved in the integration of QTLs with the biological interpretation of high-throughput technologies. The thesis begins with an investigation of the bioinformatics tools and procedures available for genome research, briefly reviewing microarray technology and commonly applied experimental designs. A targeted experimental design based on the concept of genetical genomics is then presented and applied in order to study a known functional QTL responsible for chicken body weight. This approach contrasts the gene expression levels of two alternative QTL genotypes, hence narrowing the QTL-phenotype gap, and, giving a direct quantification of the link between the genotypes and the genetic responses. Potential candidate genes responsible for the chicken body weight QTL are identified by using the location of the genes, their expression and biological significance. In order to deal with the multiple sources of information and exploit the data effectively, a systematic approach and a relational database were developed to improve the annotation of the probes of the ARK-Genomics G. gallus 13K v4.0 cDNA array utilized on the experiment. To follow up the investigation of the targeted genetical genomics study, a detailed functional analysis is performed on the dataset. The aim is to identify the downstream effects through the identification of functional variation found in pathways, and secondly to achieve a further characterization of potential candidate genes by using comparative genomics and sequence analyses. Finally the investigation of the body weight QTL syntenic regions and their reported QTLs are presented

    A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Several computational candidate gene selection and prioritization methods have recently been developed. These <it>in silico </it>selection and prioritization techniques are usually based on two central approaches - the examination of similarities to known disease genes and/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats. Here we employ a previously described method of candidate gene prioritization based mainly on gene annotation, in accompaniment with a technique based on the evaluation of pertinent sequence motifs or signatures, in an attempt to refine the gene prioritization approach. We apply this approach to X-linked mental retardation (XLMR), a group of heterogeneous disorders for which some of the underlying genetics is known.</p> <p>Results</p> <p>The gene annotation-based binary filtering method yielded a ranked list of putative XLMR candidate genes with good plausibility of being associated with the development of mental retardation. In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes. High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes. The computational tools developed for the motif-based LDA is integrated into the freely available genomic analysis portal Galaxy (<url>http://main.g2.bx.psu.edu/</url>). Nine genes (<it>APLN</it>, <it>ZC4H2</it>, <it>MAGED4</it>, <it>MAGED4B</it>, <it>RAP2C</it>, <it>FAM156A</it>, <it>FAM156B</it>, <it>TBL1X</it>, and <it>UXT</it>) were highlighted as highly-ranked XLMR methods.</p> <p>Conclusions</p> <p>The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR.</p> <p><it>Reviewers: This article was reviewed by Dr Barbara Bardoni (nominated by Prof Juergen Brosius); Prof Neil Smalheiser and Dr Dustin Holloway (nominated by Prof Charles DeLisi).</it></p

    Doctor of Philosophy

    Get PDF
    dissertationSuccessful molecular diagnosis using an exome sequence hinges on accurate association of damaging variants to the patient's phenotype. Unfortunately, many clinical scenarios (e.g., single affected or small nuclear families) have little power to confidently identify damaging alleles using sequence data alone. Today's diagnostic tools are simply underpowered for accurate diagnosis in these situations, limiting successful diagnoses. In response, clinical genetics relies on candidate-gene and variant lists to limit the search space. Despite their practical utility, these lists suffer from inherent and significant limitations. The impact of false negatives on diagnostic accuracy is considerable because candidate-genes and variants lists are assembled ad hoc, choosing alleles based upon expert knowledge. Alleles not in the list are not considered-ending hope for novel discoveries. Rational alternatives to ad hoc assemblages of candidate lists are thus badly needed. In response, I created Phevor, the Phenotype Driven Variant Ontological Re-ranking tool. Phevor works by combining knowledge resident in biomedical ontologies, like the human phenotype and gene ontologies, with the outputs of variant-interpretation tools such as SIFT, GERP+, Annovar and VAAST. Phevor can then accurately to prioritize candidates identified by third-party variant-interpretation tools in light of knowledge found in the ontologies, effectively bypassing the need for candidate-gene and variant lists. Phevor differs from tools such as Phenomizer and Exomiser, as it does not postulate a set of fixed associations between genes and phenotypes. Rather, Phevor dynamically integrates knowledge resident in multiple bio-ontologies into the prioritization process. This enables Phevor to improve diagnostic accuracy for established diseases and previously undescribed or atypical phenotypes. Inserting known disease-alleles into otherwise healthy exomes benchmarked Phevor. Using the phenotype of the known disease, and the variant interpretation tool VAAST (Variant Annotation, Analysis and Search Tool), Phevor can rank 100% of the known alleles in the top 10 and 80% as the top candidate. Phevor is currently part of the pipeline used to diagnose cases as part the Utah Genome Project. Successful diagnoses of several phenotypes have proven Phevor to be a reliable diagnostic tool that can improve the analysis of any disease-gene search

    Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

    Get PDF
    Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts. This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results. Part I deals with biomedical text mining: Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues. In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005). In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (KĆ¼ffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation. Part II deals with gene expression data analysis: Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks. Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets. In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006). The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models. Part III deals with integrated approaches and thus provides the connection between parts I and II: Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining. In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (KĆ¼ffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation. Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented

    Hyperosmotic priming of arabidopsis seedlings establishes a long-term somatic memory accompanied by specific changes of the epigenome

    Get PDF
    &lt;p&gt;Background: In arid and semi-arid environments, drought and soil salinity usually occur at the beginning and end of a plant's life cycle, offering a natural opportunity for the priming of young plants to enhance stress tolerance in mature plants. Chromatin marks, such as histone modifications, provide a potential molecular mechanism for priming plants to environmental stresses, but whether transient exposure of seedlings to hyperosmotic stress leads to chromatin changes that are maintained throughout vegetative growth remains unclear.&lt;/p&gt; &lt;p&gt;Results: We have established an effective protocol for hyperosmotic priming in the model plant Arabidopsis, which includes a transient mild salt treatment of seedlings followed by an extensive period of growth in control conditions. Primed plants are identical to non-primed plants in growth and development, yet they display reduced salt uptake and enhanced drought tolerance after a second stress exposure. ChIP-seq analysis of four histone modifications revealed that the priming treatment altered the epigenomic landscape; the changes were small but they were specific for the treated tissue, varied in number and direction depending on the modification, and preferentially targeted transcription factors. Notably, priming leads to shortening and fractionation of H3K27me3 islands. This effect fades over time, but is still apparent after a ten day growth period in control conditions. Several genes with priming-induced differences in H3K27me3 showed altered transcriptional responsiveness to the second stress treatment.&lt;/p&gt; &lt;p&gt;Conclusion: Experience of transient hyperosmotic stress by young plants is stored in a long-term somatic memory comprising differences of chromatin status, transcriptional responsiveness and whole plant physiology.&lt;/p&gt
    • ā€¦
    corecore