42 research outputs found

    AgBase: supporting functional modeling in agricultural organisms

    Get PDF
    AgBase (http://www.agbase.msstate.edu/) provides resources to facilitate modeling of functional genomics data and structural and functional annotation of agriculturally important animal, plant, microbe and parasite genomes. The website is redesigned to improve accessibility and ease of use, including improved search capabilities. Expanded capabilities include new dedicated pages for horse, cat, dog, cotton, rice and soybean. We currently provide 590 240 Gene Ontology (GO) annotations to 105 454 gene products in 64 different species, including GO annotations linked to transcripts represented on agricultural microarrays. For many of these arrays, this provides the only functional annotation available. GO annotations are available for download and we provide comprehensive, species-specific GO annotation files for 18 different organisms. The tools available at AgBase have been expanded and several existing tools improved based upon user feedback. One of seven new tools available at AgBase, GOModeler, supports hypothesis testing from functional genomics data. We host several associated databases and provide genome browsers for three agricultural pathogens. Moreover, we provide comprehensive training resources (including worked examples and tutorials) via links to Educational Resources at the AgBase website

    Experimental annotation of post-translational features and translated coding regions in the pathogen Salmonella Typhimurium

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. However, determining protein-coding genes for most new genomes is almost completely performed by inference using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function.</p> <p>Results</p> <p>We experimentally annotated the bacterial pathogen <it>Salmonella </it>Typhimurium 14028, using "shotgun" proteomics to accurately uncover the translational landscape and post-translational features. The data provide protein-level experimental validation for approximately half of the predicted protein-coding genes in <it>Salmonella </it>and suggest revisions to several genes that appear to have incorrectly assigned translational start sites, including a potential novel alternate start codon. Additionally, we uncovered 12 non-annotated genes missed by gene prediction programs, as well as evidence suggesting a role for one of these novel ORFs in <it>Salmonella </it>pathogenesis. We also characterized post-translational features in the <it>Salmonella </it>genome, including chemical modifications and proteolytic cleavages. We find that bacteria have a much larger and more complex repertoire of chemical modifications than previously thought including several novel modifications. Our <it>in vivo </it>proteolysis data identified more than 130 signal peptide and N-terminal methionine cleavage events critical for protein function.</p> <p>Conclusion</p> <p>This work highlights several ways in which application of proteomics data can improve the quality of genome annotations to facilitate novel biological insights and provides a comprehensive proteome map of <it>Salmonella </it>as a resource for systems analysis.</p

    Improved Algorithms for Discovery of New Genes in Bacterial Genomes

    Get PDF
    In this dissertation, we describe a new approach for gene finding that can utilize proteomics information in addition to DNA and RNA to identify new genes in prokaryote genomes. Proteomics processing pipelines require identification of small pieces of proteins called peptides. Peptide identification is a very error-prone process and we have developed a new algorithm for validating peptide identifications using a distance-based outlier detection method. We demonstrate that our method identifies more peptides than other popular methods using standard mixtures of known proteins. In addition, our algorithm provides a much more accurate estimate of the false discovery rate than other methods. Once peptides have been identified and validated, we use a second algorithm, proteogenomic mapping (PGM) to map these peptides to the genome to find the genetic signals that allow us to identify potential novel protein coding genes called expressed Protein Sequence Tags (ePSTs). We then collect and combine evidence for ePSTs we generated, and evaluate the likelihood that each ePST represents a true new protein coding gene using supervised machine learning techniques. We use machine learning approaches to evaluate the likelihood that the ePSTs represent new genes. Finally, we have developed new approaches to Bayesian learning that allow us to model the knowledge domain from sparse biological datasets. We have developed two new bootstrap approaches that utilize resampling to build networks with the most robust features that reoccur in many networks. These bootstrap methods yield improved prediction accuracy. We have also developed an unsupervised Bayesian network structure learning method that can be used when training data is not available or when labels may not be reliable

    Improved Algorithms for Discovery of New Genes in Bacterial Genomes

    Get PDF
    In this dissertation, we describe a new approach for gene finding that can utilize proteomics information in addition to DNA and RNA to identify new genes in prokaryote genomes. Proteomics processing pipelines require identification of small pieces of proteins called peptides. Peptide identification is a very error-prone process and we have developed a new algorithm for validating peptide identifications using a distance-based outlier detection method. We demonstrate that our method identifies more peptides than other popular methods using standard mixtures of known proteins. In addition, our algorithm provides a much more accurate estimate of the false discovery rate than other methods. Once peptides have been identified and validated, we use a second algorithm, proteogenomic mapping (PGM) to map these peptides to the genome to find the genetic signals that allow us to identify potential novel protein coding genes called expressed Protein Sequence Tags (ePSTs). We then collect and combine evidence for ePSTs we generated, and evaluate the likelihood that each ePST represents a true new protein coding gene using supervised machine learning techniques. We use machine learning approaches to evaluate the likelihood that the ePSTs represent new genes. Finally, we have developed new approaches to Bayesian learning that allow us to model the knowledge domain from sparse biological datasets. We have developed two new bootstrap approaches that utilize resampling to build networks with the most robust features that reoccur in many networks. These bootstrap methods yield improved prediction accuracy. We have also developed an unsupervised Bayesian network structure learning method that can be used when training data is not available or when labels may not be reliable

    An integrated model system to gain mechanistic insights into biofilm-associated antimicrobial resistance in Pseudomonas aeruginosa MPAO1

    Get PDF
    open access articlePseudomonas aeruginosa MPAO1 is the parental strain of the widely utilized transposon mutant collection for this important clinical pathogen. Here, we validate a model system to identify genes involved in biofilm growth and biofilm-associated antibiotic resistance. Our model employs a genomics-driven workflow to assemble the complete MPAO1 genome, identify unique and conserved genes by comparative genomics with the PAO1 reference strain and genes missed within existing assemblies by proteogenomics. Among over 200 unique MPAO1 genes, we identified six general essential genes that were overlooked when mapping public Tn-seq data sets against PAO1, including an antitoxin. Genomic data were integrated with phenotypic data from an experimental workflow using a user-friendly, soft lithography-based microfluidic flow chamber for biofilm growth and a screen with the Tn-mutant library in microtiter plates. The screen identified hitherto unknown genes involved in biofilm growth and antibiotic resistance. Experiments conducted with the flow chamber across three laboratories delivered reproducible data on P. aeruginosa biofilms and validated the function of both known genes and genes identified in the Tn-mutant screens. Differential protein abundance data from planktonic cells versus biofilm confirmed the upregulation of candidates known to affect biofilm formation, of structural and secreted proteins of type VI secretion systems, and provided proteogenomic evidence for some missed MPAO1 genes. This integrated, broadly applicable model promises to improve the mechanistic understanding of biofilm formation, antimicrobial tolerance, and resistance evolution in biofilms

    Developing a bioinformatics framework for proteogenomics

    Get PDF
    In the last 15 years, since the human genome was first sequenced, genome sequencing and annotation have continued to improve. However, genome annotation has not kept up with the accelerating rate of genome sequencing and as a result there is now a large backlog of genomic data waiting to be interpreted both quickly and accurately. Through advances in proteomics a new field has emerged to help improve genome annotation, termed proteogenomics, which uses peptide mass spectrometry data, enabling the discovery of novel protein coding genes, as well as the refinement and validation of known and putative protein-coding genes. The annotation of genomes relies heavily on ab initio gene prediction programs and/or mapping of a range of RNA transcripts. Although this method provides insights into the gene content of genomes it is unable to distinguish protein-coding genes from putative non-coding RNA genes. This problem is further confounded by the fact that only 5% of the public protein sequence repository at UniProt/SwissProt has been curated and derived from actual protein evidence. This thesis contends that it is critically important to incorporate proteomics data into genome annotation pipelines to provide experimental protein-coding evidence. Although there have been major improvements in proteogenomics over the last decade there are still numerous challenges to overcome. These key challenges include the loss of sensitivity when using inflated search spaces of putative sequences, how best to interpret novel identifications and how best to control for false discoveries. This thesis addresses the existing gap between the use of genomic and proteomic sources for accurate genome annotation by applying a proteogenomics approach with a customised methodology. This new approach was applied within four case studies: a prokaryote bacterium; a monocotyledonous wheat plant; a dicotyledonous grape plant; and human. The key contributions of this thesis are: a new methodology for proteogenomics analysis; 145 suggested gene refinements in Bradyrhizobium diazoefficiens (nitrogen-fixing bacteria); 55 new gene predictions (57 protein isoforms) in Vitis vinifera (grape); 49 new gene predictions (52 protein isoforms) in Homo sapiens (human); and 67 new gene predictions (70 protein isoforms) in Triticum aestivum (bread wheat). Lastly, a number of possible improvements for the studies conducted in this thesis and proteogenomics as a whole have been identified and discussed

    Identification, organisation and visualisation of complete proteomes in UniProt throughout all taxonomic ranks :|barchaea, bacteria, eukatyote and virus

    Get PDF
    Users of uniprot.org want to be able to query, retrieve and download proteome sets for an organism of their choice. They expect the data to be easily accessed, complete and up to date based on current available knowledge. UniProt release 2012_01 (25th Jan 2012) contains the proteomes of 2,923 organisms; 50% of which are bacteria, 38% viruses, 8% eukaryota and 4% archaea. Note that the term 'organism' is used in a broad sense to include subspecies, strains and isolates. Each completely sequenced organism is processed as an independent organism, hence the availability of 38 strain-specific proteomes Escherichia coli that are accessible for download. There is a project within UniProt dedicated to the mammoth task of maintaining the “Proteomes database”. This active resource is essential for UniProt to continually provide high quality proteome sets to the users. Accurate identification and incorporation of new, publically available, proteomes as well as the maintenance of existing proteomes permits sustained growth of the proteomes project. This is a huge, complicated and vital task accomplished by the activities of both curators and programmers. This thesis explains the data input and output of the proteomes database: the flow of genome project data from the nucleotide database into the proteomes database, then from each genome how a proteome is identified, augmented and made visible to uniprot.org users. Along this journey of discovery many issues arose, puzzles concerning data gathering, data integrity and also data visualisation. All were resolved and the outcome is a well-documented, actively maintained database that strives to provide optimal proteome information to its users

    Similarity-based gene detection: using COGs to find evolutionarily-conserved ORFs

    Get PDF
    BACKGROUND: Experimental verification of gene products has not kept pace with the rapid growth of microbial sequence information. However, existing annotations of gene locations contain sufficient information to screen for probable errors. Furthermore, comparisons among genomes become more informative as more genomes are examined. We studied all open reading frames (ORFs) of at least 30 codons from the genomes of 27 sequenced bacterial strains. We grouped the potential peptide sequences encoded from the ORFs by forming Clusters of Orthologous Groups (COGs). We used this grouping in order to find homologous relationships that would not be distinguishable from noise when using simple BLAST searches. Although COG analysis was initially developed to group annotated genes, we applied it to the task of grouping anonymous DNA sequences that may encode proteins. RESULTS: "Mixed COGs" of ORFs (clusters in which some sequences correspond to annotated genes and some do not) are attractive targets when seeking errors of gene predicion. Examination of mixed COGs reveals some situations in which genes appear to have been missed in current annotations and a smaller number of regions that appear to have been annotated as gene loci erroneously. This technique can also be used to detect potential pseudogenes or sequencing errors. Our method uses an adjustable parameter for degree of conservation among the studied genomes (stringency). We detail results for one level of stringency at which we found 83 potential genes which had not previously been identified, 60 potential pseudogenes, and 7 sequences with existing gene annotations that are probably incorrect. CONCLUSION: Systematic study of sequence conservation offers a way to improve existing annotations by identifying potentially homologous regions where the annotation of the presence or absence of a gene is inconsistent among genomes

    Genomic and Proteomic Characterisation of the European House Dust Mite, Dermatophagoides pteronyssinus

    Get PDF
    House dust mites are major causative agents in the pathogenesis of allergy. Their proximity with human habitats, association with development of allergenic diseases, and resistance to physical and chemical control measures; make them some of the most medically important mites. Understanding of house dust mites has been hampered by a lack of genomic sequence data and limited to a discrete number of proteins. The work presented here is a detailed characterisation of the European house dust mite, Dermatophagoides pteronyssinus airmid strain, at the genomic and proteomic level. Genomic sequencing and assembly resulted in a high-quality assembly of 70.76 Mb in size with 96.86% coverage. A comprehensive bioinformatic and proteomic examination was conducted on the 12,530 predicted proteins, validating the expression of 4,002. A small group of D. pteronyssinus airmid proteins showed significant homology to known allergens from other species. A large scale comparative proteomic investigation of the mite body and spent growth medium allowed for: (i) qualitative assessment of allergen localisation and (ii) the identification of numerous enzymes that may be involved in key physiological activities. The characterisation of protein extracts from house dust also identified a substantial number of uncharacterised D. pteronyssinus proteins in addition to known and putative allergens. The genes encoding novel β-1,3 glucanases were identified within a trigene cluster in D. pteronyssinus airmid. Recombinant protein expression, biochemical and proteomic analysis revealed Glu1 and Glu2 to exhibit hydrolytic activity toward β-1,3 glucans and have increased expression in the mite body and excretome of D. pteronyssinus in response to yeast diet. Further proteomic and enzymatic analysis correlated glucanase activity in house dust with presence of Glu1 and Glu2. These findings provide evidence that active β-1,3 glucanases are expressed and excreted in the faeces of D. pteronyssinus in response to fungal diet, in both the laboratory and the wild-type environment
    corecore