2,022 research outputs found

    Predicting genome-wide redundancy using machine learning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene duplication can lead to genetic redundancy, which masks the function of mutated genes in genetic analyses. Methods to increase sensitivity in identifying genetic redundancy can improve the efficiency of reverse genetics and lend insights into the evolutionary outcomes of gene duplication. Machine learning techniques are well suited to classifying gene family members into redundant and non-redundant gene pairs in model species where sufficient genetic and genomic data is available, such as <it>Arabidopsis thaliana</it>, the test case used here.</p> <p>Results</p> <p>Machine learning techniques that combine multiple attributes led to a dramatic improvement in predicting genetic redundancy over single trait classifiers alone, such as BLAST E-values or expression correlation. In withholding analysis, one of the methods used here, Support Vector Machines, was two-fold more precise than single attribute classifiers, reaching a level where the majority of redundant calls were correctly labeled. Using this higher confidence in identifying redundancy, machine learning predicts that about half of all genes in <it>Arabidopsis </it>showed the signature of predicted redundancy with at least one but typically less than three other family members. Interestingly, a large proportion of predicted redundant gene pairs were relatively old duplications (e.g., Ks > 1), suggesting that redundancy is stable over long evolutionary periods.</p> <p>Conclusions</p> <p>Machine learning predicts that most genes will have a functionally redundant paralog but will exhibit redundancy with relatively few genes within a family. The predictions and gene pair attributes for <it>Arabidopsis </it>provide a new resource for research in genetics and genome evolution. These techniques can now be applied to other organisms.</p

    Artificial Intelligence Tools to Better Understand Seed Dormancy and Germination

    Get PDF
    Despite a large number of publications available, the control mechanisms of seed dormancy and germination are far to be fully understood. Seed dormancy and germination are very complex biological processes and because they involve multiple factors (physiological, mechanical, and environmental) and their nonlinear interactions. This explains why extremely little variations on some of those factors and in the way they interact caused enormous variation in the obtained results. Multifactorial process like these can be modeled using computer-based tools to predict better results. In this chapter, some basic concepts relative to seed dormancy and germination and the main factors (physiological, involved in seed dormancy, particularly dormancy-inducers and dormancy-breakers, and seed germination) are reviewed. In the next two, we describe the use of artificial intelligence computer-based models to better understand the physiological mechanisms of seed dormancy (how dormancy is controlled and how can be released) and seed germination. Finally, some applications of artificial neural networks, fuzzy logic, and genetic algorithms to elucidate critical factors and predict optimal condition for seed dormancy-breaking and germination are given as examples of the utility of this powerful computer-based tools

    Artificial intelligence tools to better understand seed dormancy and germination

    Get PDF
    Despite a large number of publications available, the control mechanisms of seed dormancy and germination are far to be fully understood. Seed dormancy and germination are very complex biological processes and because they involve multiple factors (physiological, mechanical, and environmental) and their nonlinear interactions. This explains why extremely little variations on some of those factors and in the way they interact caused enormous variation in the obtained results. Multifactorial process like these can be modeled using computer-based tools to predict better results. In this chapter, some basic concepts relative to seed dormancy and germination and the main factors (physiological, involved in seed dormancy, particularly dormancy-inducers and dormancy-breakers, and seed germination) are reviewed. In the next two, we describe the use of artificial intelligence computer-based models to better understand the physiological mechanisms of seed dormancy (how dormancy is controlled and how can be released) and seed germination. Finally, some applications of artificial neural networks, fuzzy logic, and genetic algorithms to elucidate critical factors and predict optimal condition for seed dormancy-breaking and germination are given as examples of the utility of this powerful computer-based tools

    Sparse Probit Linear Mixed Model

    Full text link
    Linear Mixed Models (LMMs) are important tools in statistical genetics. When used for feature selection, they allow to find a sparse set of genetic traits that best predict a continuous phenotype of interest, while simultaneously correcting for various confounding factors such as age, ethnicity and population structure. Formulated as models for linear regression, LMMs have been restricted to continuous phenotypes. We introduce the Sparse Probit Linear Mixed Model (Probit-LMM), where we generalize the LMM modeling paradigm to binary phenotypes. As a technical challenge, the model no longer possesses a closed-form likelihood function. In this paper, we present a scalable approximate inference algorithm that lets us fit the model to high-dimensional data sets. We show on three real-world examples from different domains that in the setup of binary labels, our algorithm leads to better prediction accuracies and also selects features which show less correlation with the confounding factors.Comment: Published version, 21 pages, 6 figure

    RootNav 2.0: Deep learning for automatic navigation of complex plant root architectures

    Get PDF
    © The Author(s) 2019. Published by Oxford University Press. BACKGROUND: In recent years quantitative analysis of root growth has become increasingly important as a way to explore the influence of abiotic stress such as high temperature and drought on a plant's ability to take up water and nutrients. Segmentation and feature extraction of plant roots from images presents a significant computer vision challenge. Root images contain complicated structures, variations in size, background, occlusion, clutter and variation in lighting conditions. We present a new image analysis approach that provides fully automatic extraction of complex root system architectures from a range of plant species in varied imaging set-ups. Driven by modern deep-learning approaches, RootNav 2.0 replaces previously manual and semi-automatic feature extraction with an extremely deep multi-task convolutional neural network architecture. The network also locates seeds, first order and second order root tips to drive a search algorithm seeking optimal paths throughout the image, extracting accurate architectures without user interaction. RESULTS: We develop and train a novel deep network architecture to explicitly combine local pixel information with global scene information in order to accurately segment small root features across high-resolution images. The proposed method was evaluated on images of wheat (Triticum aestivum L.) from a seedling assay. Compared with semi-automatic analysis via the original RootNav tool, the proposed method demonstrated comparable accuracy, with a 10-fold increase in speed. The network was able to adapt to different plant species via transfer learning, offering similar accuracy when transferred to an Arabidopsis thaliana plate assay. A final instance of transfer learning, to images of Brassica napus from a hydroponic assay, still demonstrated good accuracy despite many fewer training images. CONCLUSIONS: We present RootNav 2.0, a new approach to root image analysis driven by a deep neural network. The tool can be adapted to new image domains with a reduced number of images, and offers substantial speed improvements over semi-automatic and manual approaches. The tool outputs root architectures in the widely accepted RSML standard, for which numerous analysis packages exist (http://rootsystemml.github.io/), as well as segmentation masks compatible with other automated measurement tools. The tool will provide researchers with the ability to analyse root systems at larget scales than ever before, at a time when large scale genomic studies have made this more important than ever

    Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Arabidopsis thaliana </it>is the model species of current plant genomic research with a genome size of 125 Mb and approximately 28,000 genes. The function of half of these genes is currently unknown. The purpose of this study is to infer gene function in Arabidopsis using machine-learning algorithms applied to large-scale gene expression data sets, with the goal of identifying genes that are potentially involved in plant response to abiotic stress.</p> <p>Results</p> <p>Using in house and publicly available data, we assembled a large set of gene expression measurements for <it>A. thaliana</it>. Using those genes of known function, we first evaluated and compared the ability of basic machine-learning algorithms to predict which genes respond to stress. Predictive accuracy was measured using ROC<sub>50 </sub>and precision curves derived through cross validation. To improve accuracy, we developed a method for combining these classifiers using a weighted-voting scheme. The combined classifier was then trained on genes of known function and applied to genes of unknown function, identifying genes that potentially respond to stress. Visual evidence corroborating the predictions was obtained using electronic Northern analysis. Three of the predicted genes were chosen for biological validation. Gene knockout experiments confirmed that all three are involved in a variety of stress responses. The biological analysis of one of these genes (At1g16850) is presented here, where it is shown to be necessary for the normal response to temperature and NaCl.</p> <p>Conclusion</p> <p>Supervised learning methods applied to large-scale gene expression measurements can be used to predict gene function. However, the ability of basic learning methods to predict stress response varies widely and depends heavily on how much dimensionality reduction is used. Our method of combining classifiers can improve the accuracy of such predictions – in this case, predictions of genes involved in stress response in plants – and it effectively chooses the appropriate amount of dimensionality reduction automatically. The method provides a useful means of identifying genes in <it>A. thaliana </it>that potentially respond to stress, and we expect it would be useful in other organisms and for other gene functions.</p

    Non-homology-based prediction of gene functions in maize (\u3ci\u3eZea mays\u3c/i\u3e ssp. \u3ci\u3emays\u3c/i\u3e)

    Get PDF
    Advances in genome sequencing and annotation have eased the difficulty of identifying new gene sequences. Predicting the functions of these newly identified genes remains challenging. Genes descended from a common ancestral sequence are likely to have common functions.As a result, homology is widely used for gene function prediction. This means functional annotation errors also propagate from one species to another. Several approaches based on machine learning classification algorithms were evaluated for their ability to accurately predict gene function from non-homology gene features. Among the eight supervised classification algorithms evaluated, random forest-based prediction consistently provided the most accurate gene function prediction. Non-homology-based functional annotation provides complementary strengths to homology-based annotation, with higher average performance in Biological Process GO terms, the domain where homology-based functional annotation performs the worst, and weaker performance in Molecular Function GO terms, the domain where the accuracy of homology-based functional annotation is highest. GO prediction models trained with homology-based annotations were able to successfully predict annotations from a manually curated “gold standard” GO annotation set. Non-homology-based functional annotation based on machine learning may ultimately prove useful both as a method to assign predicted functions to orphan genes which lack functionally characterized homologs, and to identify and correct functional annotation errors which were propagated through homology-based functional annotations

    Gene regulatory network inference : connecting plant biology and mathematical modeling

    Get PDF
    Plant responses to environmental and intrinsic signals are tightly controlled by multiple transcription factors (TFs). These TFs and their regulatory connections form gene regulatory networks (GRNs), which provide a blueprint of the transcriptional regulations underlying plant development and environmental responses. This review provides examples of experimental methodologies commonly used to identify regulatory interactions and generate GRNs. Additionally, this review describes network inference techniques that leverage gene expression data to predict regulatory interactions. These computational and experimental methodologies yield complex networks that can identify new regulatory interactions, driving novel hypotheses. Biological properties that contribute to the complexity of GRNs are also described in this review. These include network topology, network size, transient binding of TFs to DNA, and competition between multiple upstream regulators. Finally, this review highlights the potential of machine learning approaches to leverage gene expression data to predict phenotypic outputs

    Simple statistical models predict C-to-U edited sites in plant mitochondrial RNA

    Get PDF
    BACKGROUND: RNA editing is the process whereby an RNA sequence is modified from the sequence of the corresponding DNA template. In the mitochondria of land plants, some cytidines are converted to uridines before translation. Despite substantial study, the molecular biological mechanism by which C-to-U RNA editing proceeds remains relatively obscure, although several experimental studies have implicated a role for cis-recognition. A highly non-random distribution of nucleotides is observed in the immediate vicinity of edited sites (within 20 nucleotides 5' and 3'), but no precise consensus motif has been identified. RESULTS: Data for analysis were derived from the the complete mitochondrial genomes of Arabidopsis thaliana, Brassica napus, and Oryza sativa; additionally, a combined data set of observations across all three genomes was generated. We selected datasets based on the 20 nucleotides 5' and the 20 nucleotides 3' of edited sites and an equivalently sized and appropriately constructed null-set of non-edited sites. We used tree-based statistical methods and random forests to generate models of C-to-U RNA editing based on the nucleotides surrounding the edited/non-edited sites and on the estimated folding energies of those regions. Tree-based statistical methods based on primary sequence data surrounding edited/non-edited sites and estimates of free energy of folding yield models with optimistic re-substitution-based estimates of ~0.71 accuracy, ~0.64 sensitivity, and ~0.88 specificity. Random forest analysis yielded better models and more exact performance estimates with ~0.74 accuracy, ~0.72 sensitivity, and ~0.81 specificity for the combined observations. CONCLUSIONS: Simple models do moderately well in predicting which cytidines will be edited to uridines, and provide the first quantitative predictive models for RNA edited sites in plant mitochondria. Our analysis shows that the identity of the nucleotide -1 to the edited C and the estimated free energy of folding for a 41 nt region surrounding the edited C are the most important variables that distinguish most edited from non-edited sites. However, the results suggest that primary sequence data and simple free energy of folding calculations alone are insufficient to make highly accurate predictions
    • 

    corecore