375 research outputs found

    C-terminal motif prediction in eukaryotic proteomes using comparative genomics and statistical over-representation across protein families

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The carboxy termini of proteins are a frequent site of activity for a variety of biologically important functions, ranging from post-translational modification to protein targeting. Several short peptide motifs involved in protein sorting roles and dependent upon their proximity to the C-terminus for proper function have already been characterized. As a limited number of such motifs have been identified, the potential exists for genome-wide statistical analysis and comparative genomics to reveal novel peptide signatures functioning in a C-terminal dependent manner. We have applied a novel methodology to the prediction of C-terminal-anchored peptide motifs involving a simple z-statistic and several techniques for improving the signal-to-noise ratio.</p> <p>Results</p> <p>We examined the statistical over-representation of position-specific C-terminal tripeptides in 7 eukaryotic proteomes. Sequence randomization models and simple-sequence masking were applied to the successful reduction of background noise. Similarly, as C-terminal homology among members of large protein families may artificially inflate tripeptide counts in an irrelevant and obfuscating manner, gene-family clustering was performed prior to the analysis in order to assess tripeptide over-representation across protein families as opposed to across all proteins. Finally, comparative genomics was used to identify tripeptides significantly occurring in multiple species. This approach has been able to predict, to our knowledge, all C-terminally anchored targeting motifs present in the literature. These include the PTS1 peroxisomal targeting signal (SKL*), the ER-retention signal (K/HDEL*), the ER-retrieval signal for membrane bound proteins (KKxx*), the prenylation signal (CC*) and the CaaX box prenylation motif. In addition to a high statistical over-representation of these known motifs, a collection of significant tripeptides with a high propensity for biological function exists between species, among kingdoms and across eukaryotes. Motifs of note include a serine-acidic peptide (DSD*) as well as several lysine enriched motifs found in nearly all eukaryotic genomes examined.</p> <p>Conclusion</p> <p>We have successfully generated a high confidence representation of eukaryotic motifs anchored at the C-terminus. A high incidence of true-positives in our results suggests that several previously unidentified tripeptide patterns are strong candidates for representing novel peptide motifs of a widely employed nature in the C-terminal biology of eukaryotes. Our application of comparative genomics, statistical over-representation and the adjustment for protein family homology has generated several hypotheses concerning the C-terminal topology as it pertains to sorting and potential protein interaction signals. This approach to background reduction could be expanded for application to protein motif prediction in the protein interior. A parallel N-terminal analysis is presented as supplementary data.</p

    Combining classifiers to predict gene function in Arabidopsis thaliana using large-scale gene expression measurements

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Arabidopsis thaliana </it>is the model species of current plant genomic research with a genome size of 125 Mb and approximately 28,000 genes. The function of half of these genes is currently unknown. The purpose of this study is to infer gene function in Arabidopsis using machine-learning algorithms applied to large-scale gene expression data sets, with the goal of identifying genes that are potentially involved in plant response to abiotic stress.</p> <p>Results</p> <p>Using in house and publicly available data, we assembled a large set of gene expression measurements for <it>A. thaliana</it>. Using those genes of known function, we first evaluated and compared the ability of basic machine-learning algorithms to predict which genes respond to stress. Predictive accuracy was measured using ROC<sub>50 </sub>and precision curves derived through cross validation. To improve accuracy, we developed a method for combining these classifiers using a weighted-voting scheme. The combined classifier was then trained on genes of known function and applied to genes of unknown function, identifying genes that potentially respond to stress. Visual evidence corroborating the predictions was obtained using electronic Northern analysis. Three of the predicted genes were chosen for biological validation. Gene knockout experiments confirmed that all three are involved in a variety of stress responses. The biological analysis of one of these genes (At1g16850) is presented here, where it is shown to be necessary for the normal response to temperature and NaCl.</p> <p>Conclusion</p> <p>Supervised learning methods applied to large-scale gene expression measurements can be used to predict gene function. However, the ability of basic learning methods to predict stress response varies widely and depends heavily on how much dimensionality reduction is used. Our method of combining classifiers can improve the accuracy of such predictions – in this case, predictions of genes involved in stress response in plants – and it effectively chooses the appropriate amount of dimensionality reduction automatically. The method provides a useful means of identifying genes in <it>A. thaliana </it>that potentially respond to stress, and we expect it would be useful in other organisms and for other gene functions.</p

    The role of the Arabidopsis FUSCA3 transcription factor during inhibition of seed germination at high temperature

    Get PDF
    Abstract Background Imbibed seeds integrate environmental and endogenous signals to break dormancy and initiate growth under optimal conditions. Seed maturation plays an important role in determining the survival of germinating seeds, for example one of the roles of dormancy is to stagger germination to prevent mass growth under suboptimal conditions. The B3-domain transcription factor FUSCA3 (FUS3) is a master regulator of seed development and an important node in hormonal interaction networks in Arabidopsis thaliana. Its function has been mainly characterized during embryonic development, where FUS3 is highly expressed to promote seed maturation and dormancy by regulating ABA/GA levels. Results In this study, we present evidence for a role of FUS3 in delaying seed germination at supraoptimal temperatures that would be lethal for the developing seedlings. During seed imbibition at supraoptimal temperature, the FUS3 promoter is reactivated and induces de novo synthesis of FUS3 mRNA, followed by FUS3 protein accumulation. Genetic analysis shows that FUS3 contributes to the delay of seed germination at high temperature. Unlike WT, seeds overexpressing FUS3 (ML1:FUS3-GFP) during imbibition are hypersensitive to high temperature and do not germinate, however, they can fully germinate after recovery at control temperature reaching 90% seedling survival. ML1:FUS3-GFP hypersensitivity to high temperature can be partly recovered in the presence of fluridone, an inhibitor of ABA biosynthesis, suggesting this hypersensitivity is due in part to higher ABA level in this mutant. Transcriptomic analysis shows that WT seeds imbibed at supraoptimal temperature activate seed-specific genes and ABA biosynthetic and signaling genes, while inhibiting genes that promote germination and growth, such as GA biosynthetic and signaling genes. Conclusion In this study, we have uncovered a novel function for the master regulator of seed maturation, FUS3, in delaying germination at supraoptimal temperature. Physiologically, this is important since delaying germination has a protective role at high temperature. Transcriptomic analysis of seeds imbibed at supraoptimal temperature reveal that a complex program is in place, which involves not only the regulation of heat and dehydration response genes to adjust cellular functions, but also the activation of seed-specific programs and the inhibition of germination-promoting programs to delay germination

    Current status of the multinational Arabidopsis community

    Get PDF
    Publisher Copyright: © 2020 The Authors. Plant Direct published by American Society of Plant Biologists and the Society for Experimental Biology and John Wiley & Sons LtdThe multinational Arabidopsis research community is highly collaborative and over the past thirty years these activities have been documented by the Multinational Arabidopsis Steering Committee (MASC). Here, we (a) highlight recent research advances made with the reference plant Arabidopsis thaliana; (b) provide summaries from recent reports submitted by MASC subcommittees, projects and resources associated with MASC and from MASC country representatives; and (c) initiate a call for ideas and foci for the “fourth decadal roadmap,” which will advise and coordinate the global activities of the Arabidopsis research community.Peer reviewe

    NLStradamus: a simple Hidden Markov Model for nuclear localization signal prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Nuclear localization signals (NLSs) are stretches of residues within a protein that are important for the regulated nuclear import of the protein. Of the many import pathways that exist in yeast, the best characterized is termed the 'classical' NLS pathway. The classical NLS contains specific patterns of basic residues and computational methods have been designed to predict the location of these motifs on proteins. The consensus sequences, or patterns, for the other import pathways are less well-understood.</p> <p>Results</p> <p>In this paper, we present an analysis of characterized NLSs in yeast, and find, despite the large number of nuclear import pathways, that NLSs seem to show similar patterns of amino acid residues. We test current prediction methods and observe a low true positive rate. We therefore suggest an approach using hidden Markov models (HMMs) to predict novel NLSs in proteins. We show that our method is able to consistently find 37% of the NLSs with a low false positive rate and that our method retains its true positive rate outside of the yeast data set used for the training parameters.</p> <p>Conclusion</p> <p>Our implementation of this model, NLStradamus, is made available at: <url>http://www.moseslab.csb.utoronto.ca/NLStradamus/</url></p

    Population Structure and Genetic Diversity in a Rice Core Collection (Oryza sativa L.) Investigated with SSR Markers

    Get PDF
    The assessment of genetic diversity and population structure of a core collection would benefit to make use of these germplasm as well as applying them in association mapping. The objective of this study were to (1) examine the population structure of a rice core collection; (2) investigate the genetic diversity within and among subgroups of the rice core collection; (3) identify the extent of linkage disequilibrium (LD) of the rice core collection. A rice core collection consisting of 150 varieties which was established from 2260 varieties of Ting's collection of rice germplasm were genotyped with 274 SSR markers and used in this study. Two distinct subgroups (i.e. SG 1 and SG 2) were detected within the entire population by different statistical methods, which is in accordance with the differentiation of indica and japonica rice. MCLUST analysis might be an alternative method to STRUCTURE for population structure analysis. A percentage of 26% of the total markers could detect the population structure as the whole SSR marker set did with similar precision. Gene diversity and MRD between the two subspecies varied considerably across the genome, which might be used to identify candidate genes for the traits under domestication and artificial selection of indica and japonica rice. The percentage of SSR loci pairs in significant (P<0.05) LD is 46.8% in the entire population and the ratio of linked to unlinked loci pairs in LD is 1.06. Across the entire population as well as the subgroups and sub-subgroups, LD decays with genetic distance, indicating that linkage is one main cause of LD. The results of this study would provide valuable information for association mapping using the rice core collection in future

    An extensive (co-)expression analysis tool for the cytochrome P450 superfamily in Arabidopsis thaliana

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Sequencing of the first plant genomes has revealed that cytochromes P450 have evolved to become the largest family of enzymes in secondary metabolism. The proportion of P450 enzymes with characterized biochemical function(s) is however very small. If P450 diversification mirrors evolution of chemical diversity, this points to an unexpectedly poor understanding of plant metabolism. We assumed that extensive analysis of gene expression might guide towards the function of P450 enzymes, and highlight overlooked aspects of plant metabolism.</p> <p>Results</p> <p>We have created a comprehensive database, 'CYPedia', describing P450 gene expression in four data sets: organs and tissues, stress response, hormone response, and mutants of <it>Arabidopsis thaliana</it>, based on public Affymetrix ATH1 microarray expression data. P450 expression was then combined with the expression of 4,130 re-annotated genes, predicted to act in plant metabolism, for co-expression analyses. Based on the annotation of co-expressed genes from diverse pathway annotation databases, co-expressed pathways were identified. Predictions were validated for most P450s with known functions. As examples, co-expression results for P450s related to plastidial functions/photosynthesis, and to phenylpropanoid, triterpenoid and jasmonate metabolism are highlighted here.</p> <p>Conclusion</p> <p>The large scale hypothesis generation tools presented here provide leads to new pathways, unexpected functions, and regulatory networks for many P450s in plant metabolism. These can now be exploited by the community to validate the proposed functions experimentally using reverse genetics, biochemistry, and metabolic profiling.</p

    Complexity and specificity of the maize (Zea mays L.) root hair transcriptome

    Get PDF
    Root hairs are tubular extensions of epidermis cells. Transcriptome profiling demonstrated that the single cell-type root hair transcriptome was less complex than the transcriptome of multiple cell-type primary roots without root hairs. In total, 831 genes were exclusively and 5585 genes were preferentially expressed in root hairs [false discovery rate (FDR) ≤1%]. Among those, the most significantly enriched Gene Ontology (GO) functional terms were related to energy metabolism, highlighting the high energy demand for the development and function of root hairs. Subsequently, the maize homologs for 138 Arabidopsis genes known to be involved in root hair development were identified and their phylogenetic relationship and expression in root hairs were determined. This study indicated that the genetic regulation of root hair development in Arabidopsis and maize is controlled by common genes, but also shows differences which need to be dissected in future genetic experiments. Finally, a maize root view of the eFP browser was implemented including the root hair transcriptome of the present study and several previously published maize root transcriptome data sets. The eFP browser provides color-coded expression levels for these root types and tissues for any gene of interest, thus providing a novel resource to study gene expression and function in maize roots
    corecore