70 research outputs found

    A proteogenomic analysis of Shigella flexneri using 2D LC-MALDI TOF/TOF

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>New strategies for high-throughput sequencing are constantly appearing, leading to a great increase in the number of completely sequenced genomes. Unfortunately, computational genome annotation is out of step with this progress. Thus, the accurate annotation of these genomes has become a bottleneck of knowledge acquisition.</p> <p>Results</p> <p>We exploited a proteogenomic approach to improve conventional genome annotation by integrating proteomic data with genomic information. Using <it>Shigella flexneri </it>2a as a model, we identified total 823 proteins, including 187 hypothetical proteins. Among them, three annotated ORFs were extended upstream through comprehensive analysis against an in-house N-terminal extension database. Two genes, which could not be translated to their full length because of stop codon 'mutations' induced by genome sequencing errors, were revised and annotated as fully functional genes. Above all, seven new ORFs were discovered, which were not predicted in <it>S. flexneri </it>2a str.301 by any other annotation approaches. The transcripts of four novel ORFs were confirmed by RT-PCR assay. Additionally, most of these novel ORFs were overlapping genes, some even nested within the coding region of other known genes.</p> <p>Conclusions</p> <p>Our findings demonstrate that current <it>Shigella </it>genome annotation methods are not perfect and need to be improved. Apart from the validation of predicted genes at the protein level, the additional features of proteogenomic tools include revision of annotation errors and discovery of novel ORFs. The complementary dataset could provide more targets for those interested in <it>Shigella </it>to perform functional studies.</p

    Synchronization of cytoplasmic and transferred mitochondrial ribosomal protein gene expression in land plants is linked to Telo-box motif enrichment

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Chloroplasts and mitochondria evolved from the endosymbionts of once free-living eubacteria, and they transferred most of their genes to the host nuclear genome during evolution. The mechanisms used by plants to coordinate the expression of such transferred genes, as well as other genes in the host nuclear genome, are still poorly understood.</p> <p>Results</p> <p>In this paper, we use nuclear-encoded chloroplast (cpRPGs), as well as mitochondrial (mtRPGs) and cytoplasmic (euRPGs) ribosomal protein genes to study the coordination of gene expression between organelles and the host. Results show that the mtRPGs, but not the cpRPGs, exhibit strongly synchronized expression with euRPGs in all investigated land plants and that this phenomenon is linked to the presence of a <it>telo</it>-box DNA motif in the promoter regions of mtRPGs and euRPGs. This motif is also enriched in the promoter regions of genes involved in DNA replication. Sequence analysis further indicates that mtRPGs, in contrast to cpRPGs, acquired <it>telo</it>-box from the host nuclear genome.</p> <p>Conclusions</p> <p>Based on our results, we propose a model of plant nuclear genome evolution where coordination of activities in mitochondria and chloroplast and other cellular functions, including cell cycle, might have served as a strong selection pressure for the differential acquisition of <it>telo</it>-box between mtRPGs and cpRPGs. This research also highlights the significance of physiological needs in shaping transcriptional regulatory evolution.</p

    Unpredictability of metabolism—the key role of metabolomics science in combination with next-generation genome sequencing

    Get PDF
    Next-generation sequencing provides technologies which sequence whole prokaryotic and eukaryotic genomes in days, perform genome-wide association studies, chromatin immunoprecipitation followed by sequencing and RNA sequencing for transcriptome studies. An exponentially growing volume of sequence data can be anticipated, yet functional interpretation does not keep pace with the amount of data produced. In principle, these data contain all the secrets of living systems, the genotype–phenotype relationship. Firstly, it is possible to derive the structure and connectivity of the metabolic network from the genotype of an organism in the form of the stoichiometric matrix N. This is, however, static information. Strategies for genome-scale measurement, modelling and predicting of dynamic metabolic networks need to be applied. Consequently, metabolomics science—the quantitative measurement of metabolism in conjunction with metabolic modelling—is a key discipline for the functional interpretation of whole genomes and especially for testing the numerical predictions of metabolism based on genome-scale metabolic network models. In this context, a systematic equation is derived based on metabolomics covariance data and the genome-scale stoichiometric matrix which describes the genotype–phenotype relationship

    TESTLoc: protein subcellular localization prediction from EST data

    Get PDF
    Abstract Background The eukaryotic cell has an intricate architecture with compartments and substructures dedicated to particular biological processes. Knowing the subcellular location of proteins not only indicates how bio-processes are organized in different cellular compartments, but also contributes to unravelling the function of individual proteins. Computational localization prediction is possible based on sequence information alone, and has been successfully applied to proteins from virtually all subcellular compartments and all domains of life. However, we realized that current prediction tools do not perform well on partial protein sequences such as those inferred from Expressed Sequence Tag (EST) data, limiting the exploitation of the large and taxonomically most comprehensive body of sequence information from eukaryotes. Results We developed a new predictor, TESTLoc, suited for subcellular localization prediction of proteins based on their partial sequence conceptually translated from ESTs (EST-peptides). Support Vector Machine (SVM) is used as computational method and EST-peptides are represented by different features such as amino acid composition and physicochemical properties. When TESTLoc was applied to the most challenging test case (plant data), it yielded high accuracy (~85%). Conclusions TESTLoc is a localization prediction tool tailored for EST data. It provides a variety of models for the users to choose from, and is available for download at http://megasun.bch.umontreal.ca/~shenyq/TESTLoc/TESTLoc.html</p

    Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies

    Get PDF
    [Image: see text] Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently underestimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five “incorrect” targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds, and we show that this effect is a product of the database and statistical modeling and not the search engine. A variety of approaches to limit database size and remove noncoding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximize the validation and discovery of gene structure in sequenced genomes, while still controlling for false positives

    Targeted reprogramming of H3K27me3 resets epigenetic memory in plant paternal chromatin

    Get PDF
    Epigenetic marks are reprogrammed in the gametes to reset genomic potential in the next generation. In mammals, paternal chromatin is extensively reprogrammed through the global erasure of DNA methylation and the exchange of histones with protamines(1,2). Precisely how the paternal epigenome is reprogrammed in flowering plants has remained unclear since DNA is not demethylated and histones are retained in sperm(3,4). Here, we describe a multi-layered mechanism by which H3K27me3 is globally lost from histone-based sperm chromatin in Arabidopsis. This mechanism involves the silencing of H3K27me3 writers, activity of H3K27me3 erasers and deposition of a sperm-specific histone, H3.10 (ref. (5)), which we show is immune to lysine 27 methylation. The loss of H3K27me3 facilitates the transcription of genes essential for spermatogenesis and pre-configures sperm with a chromatin state that forecasts gene expression in the next generation. Thus, plants have evolved a specific mechanism to simultaneously differentiate male gametes and reprogram the paternal epigenome

    Investigating the validity of current network analysis on static conglomerate networks by protein network stratification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A molecular network perspective forms the foundation of systems biology. A common practice in analyzing protein-protein interaction (PPI) networks is to perform network analysis on a conglomerate network that is an assembly of all available binary interactions in a given organism from diverse data sources. Recent studies on network dynamics suggested that this approach might have ignored the dynamic nature of context-dependent molecular systems.</p> <p>Results</p> <p>In this study, we employed a network stratification strategy to investigate the validity of the current network analysis on conglomerate PPI networks. Using the genome-scale tissue- and condition-specific proteomics data in <it>Arabidopsis thaliana</it>, we present here the first systematic investigation into this question. We stratified a conglomerate <it>A. thaliana </it>PPI network into three levels of context-dependent subnetworks. We then focused on three types of most commonly conducted network analyses, i.e., topological, functional and modular analyses, and compared the results from these network analyses on the conglomerate network and five stratified context-dependent subnetworks corresponding to specific tissues.</p> <p>Conclusions</p> <p>We found that the results based on the conglomerate PPI network are often significantly different from those of context-dependent subnetworks corresponding to specific tissues or conditions. This conclusion depends neither on relatively arbitrary cutoffs (such as those defining network hubs or bottlenecks), nor on specific network clustering algorithms for module extraction, nor on the possible high false positive rates of binary interactions in PPI networks. We also found that our conclusions are likely to be valid in human PPI networks. Furthermore, network stratification may help resolve many controversies in current research of systems biology.</p
    corecore