147 research outputs found
Assessing the precision of high-throughput computational and laboratory approaches for the genome-wide identification of protein subcellular localization in bacteria
BACKGROUND: Identification of a bacterial protein's subcellular localization (SCL) is important for genome annotation, function prediction and drug or vaccine target identification. Subcellular fractionation techniques combined with recent proteomics technology permits the identification of large numbers of proteins from distinct bacterial compartments. However, the fractionation of a complex structure like the cell into several subcellular compartments is not a trivial task. Contamination from other compartments may occur, and some proteins may reside in multiple localizations. New computational methods have been reported over the past few years that now permit much more accurate, genome-wide analysis of the SCL of protein sequences deduced from genomes. There is a need to compare such computational methods with laboratory proteomics approaches to identify the most effective current approach for genome-wide localization characterization and annotation. RESULTS: In this study, ten subcellular proteome analyses of bacterial compartments were reviewed. PSORTb version 2.0 was used to computationally predict the localization of proteins reported in these publications, and these computational predictions were then compared to the localizations determined by the proteomics study. By using a combined approach, we were able to identify a number of contaminants and proteins with dual localizations, and were able to more accurately identify membrane subproteomes. Our results allowed us to estimate the precision level of laboratory subproteome studies and we show here that, on average, recent high-precision computational methods such as PSORTb now have a lower error rate than laboratory methods. CONCLUSION: We have performed the first focused comparison of genome-wide proteomic and computational methods for subcellular localization identification, and show that computational methods have now attained a level of precision that is exceeding that of high-throughput laboratory approaches. We note that analysis of all cellular fractions collectively is required to effectively provide localization information from laboratory studies, and we propose an overall approach to genome-wide subcellular localization characterization that capitalizes on the complementary nature of current laboratory and computational methods
Evaluation of genomic island predictors using a comparative genomics approach
<p>Abstract</p> <p>Background</p> <p>Genomic islands (GIs) are clusters of genes in prokaryotic genomes of probable horizontal origin. GIs are disproportionately associated with microbial adaptations of medical or environmental interest. Recently, multiple programs for automated detection of GIs have been developed that utilize sequence composition characteristics, such as G+C ratio and dinucleotide bias. To robustly evaluate the accuracy of such methods, we propose that a dataset of GIs be constructed using criteria that are independent of sequence composition-based analysis approaches.</p> <p>Results</p> <p>We developed a comparative genomics approach (IslandPick) that identifies both very probable islands and non-island regions. The approach involves 1) flexible, automated selection of comparative genomes for each query genome, using a distance function that picks appropriate genomes for identification of GIs, 2) identification of regions unique to the query genome, compared with the chosen genomes (positive dataset) and 3) identification of regions conserved across all genomes (negative dataset). Using our constructed datasets, we investigated the accuracy of several sequence composition-based GI prediction tools.</p> <p>Conclusion</p> <p>Our results indicate that AlienHunter has the highest recall, but the lowest measured precision, while SIGI-HMM is the most precise method. SIGI-HMM and IslandPath/DIMOB have comparable overall highest accuracy. Our comparative genomics approach, IslandPick, was the most accurate, compared with a curated list of GIs, indicating that we have constructed suitable datasets. This represents the first evaluation, using diverse and, independent datasets that were not artificially constructed, of the accuracy of several sequence composition-based GI predictors. The caveats associated with this analysis and proposals for optimal island prediction are discussed.</p
Identification of the Regulatory Logic Controlling Salmonella Pathoadaptation by the SsrA-SsrB Two-Component System
Sequence data from the past decade has laid bare the significance of horizontal gene transfer in creating genetic diversity in the bacterial world. Regulatory evolution, in which non-coding DNA is mutated to create new regulatory nodes, also contributes to this diversity to allow niche adaptation and the evolution of pathogenesis. To survive in the host environment, Salmonella enterica uses a type III secretion system and effector proteins, which are activated by the SsrA-SsrB two-component system in response to the host environment. To better understand the phenomenon of regulatory evolution in S. enterica, we defined the SsrB regulon and asked how this transcription factor interacts with the cis-regulatory region of target genes. Using ChIP-on-chip, cDNA hybridization, and comparative genomics analyses, we describe the SsrB-dependent regulon of ancestral and horizontally acquired genes. Further, we used a genetic screen and computational analyses integrating experimental data from S. enterica and sequence data from an orthologous regulatory system in the insect endosymbiont, Sodalis glossinidius, to identify the conserved yet flexible palindrome sequence that defines DNA recognition by SsrB. Mutational analysis of a representative promoter validated this palindrome as the minimal architecture needed for regulatory input by SsrB. These data provide a high-resolution map of a regulatory network and the underlying logic enabling pathogen adaptation to a host
Pathway-GPS and SIGORA: identifying relevant pathways based on the over-representation of their gene-pair signatures
peer-reviewedMotivation. Predominant pathway analysis approaches treat pathways as collections of individual genes and consider all pathway members as equally informative. As a result, at times spurious and misleading pathways are inappropriately identified as statistically significant, solely due to components that they share with the more relevant pathways.
Results. We introduce the concept of Pathway Gene-Pair Signatures (Pathway-GPS) as pairs of genes that, as a combination, are specific to a single pathway. We devised and implemented a novel approach to pathway analysis, Signature Over-representation Analysis (SIGORA), which focuses on the statistically significant enrichment of Pathway-GPS in a user-specified gene list of interest. In a comparative evaluation of several published datasets, SIGORA outperformed traditional methods by delivering biologically more plausible and relevant results.
Availability. An efficient implementation of SIGORA, as an R package with precompiled GPS data for several human and mouse pathway repositories is available for download from http://sigora.googlecode.com/svn/
InnateDB: systems biology of innate immunity and beyond—recent updates and continuing curation
peer-reviewedInnateDB (http://www.innatedb.com) is an integrated analysis platform that has been specifically designed to facilitate systems-level analyses of mammalian innate immunity networks, pathways and genes. In this article, we provide details of recent updates and improvements to the database.
InnateDB now contains >196 000 human, mouse
and bovine experimentally validated molecular
interactions and 3000 pathway annotations of
relevance to all mammalian cellular systems (i.e. not just immune relevant pathways and interactions). In addition, the InnateDB team has, to date, manually curated in excess of 18 000 molecular interactions of relevance to innate immunity, providing unprecedented insight into innate immunity networks, pathways and their component
molecules. More recently, InnateDB has also
initiated the curation of allergy- and asthma-related interactions. Furthermore, we report a range of improvements to our integrated bioinformatics solutions
including web service access to InnateDB
interaction data using Proteomics Standards
Initiative Common Query Interface, enhanced Gene Ontology analysis for innate immunity, and the availability of new network visualizations tools. Finally, the recent integration of bovine data makes InnateDB the first integrated network analysis
platform for this agriculturally important model organism.This work was supported by Genome BC through the Pathogenomics of Innate Immunity (PI2) project and by the Foundation for the National Institutes of Health and the Canadian Institutes of Health Research under the Grand Challenges in Global Health Research Initiative [Grand Challenges ID: 419]. Further funding was also provided by AllerGen grants 12ASI1 and 12B&B2.
D.J.L. was funded in part during this project by a postdoctoral trainee award from the Michael Smith Foundation for Health Research (MSFHR). F.S.L.B. is a MSFHR Senior Scholar and R.E.W.H. holds a Canada Research Chair (CRC). Funding to enable bovine systems biology in InnateDB is provided by Teagasc [RMIS6018] and the Teagasc Walsh Fellowship scheme. IMEx is funded by the European Commission under the PSIMEx project [contract number FP7-HEALTH-2007-223411].
Funding for open access charge: Teagasc [RMIS6018]
PSORTdb: a protein subcellular localization database for bacteria
Information about bacterial subcellular localization (SCL) is important for protein function prediction and identification of suitable drug/vaccine/diagnostic targets. PSORTdb (http://db.psort.org/) is a web-accessible database of SCL for bacteria that contains both information determined through laboratory experimentation and computational predictions. The dataset of experimentally verified information (∼2000 proteins) was manually curated by us and represents the largest dataset of its kind. Earlier versions have been used for training SCL predictors, and its incorporation now into this new PSORTdb resource, with its associated additional annotation information and dataset version control, should aid researchers in future development of improved SCL predictors. The second component of this database contains computational analyses of proteins deduced from the most recent NCBI dataset of completely sequenced genomes. Analyses are currently calculated using PSORTb, the most precise automated SCL predictor for bacterial proteins. Both datasets can be accessed through the web using a very flexible text search engine, a data browser, or using BLAST, and the entire database or search results may be downloaded in various formats. Features such as GO ontologies and multiple accession numbers are incorporated to facilitate integration with other bioinformatics resources. PSORTdb is freely available under GNU General Public License
The Burkholderia Genome Database: facilitating flexible queries and comparative analyses
Summary: As the genome sequences of multiple strains of a given bacterial species are obtained, more generalized bacterial genome databases may be complemented by databases that are focused on providing more information geared for a distinct bacterial phylogenetic group and its associated research community. The Burkholderia Genome Database represents a model for such a database, providing a powerful, user-friendly search and comparative analysis interface that contains features not found in other genome databases. It contains continually updated, curated and tracked information about Burkholderia cepacia complex genome annotations, plus other Burkholderia species genomes for comparison, providing a high-quality resource for its targeted cystic fibrosis research community
Evidence of a Large Novel Gene Pool Associated with Prokaryotic Genomic Islands
Microbial genes that are “novel” (no detectable homologs in other species) have become of increasing interest as environmental sampling suggests that there are many more such novel genes in yet-to-be-cultured microorganisms. By analyzing known microbial genomic islands and prophages, we developed criteria for systematic identification of putative genomic islands (clusters of genes of probable horizontal origin in a prokaryotic genome) in 63 prokaryotic genomes, and then characterized the distribution of novel genes and other features. All but a few of the genomes examined contained significantly higher proportions of novel genes in their predicted genomic islands compared with the rest of their genome (Paired t test = 4.43E-14 to 1.27E-18, depending on method). Moreover, the reverse observation (i.e., higher proportions of novel genes outside of islands) never reached statistical significance in any organism examined. We show that this higher proportion of novel genes in predicted genomic islands is not due to less accurate gene prediction in genomic island regions, but likely reflects a genuine increase in novel genes in these regions for both bacteria and archaea. This represents the first comprehensive analysis of novel genes in prokaryotic genomic islands and provides clues regarding the origin of novel genes. Our collective results imply that there are different gene pools associated with recently horizontally transmitted genomic regions versus regions that are primarily vertically inherited. Moreover, there are more novel genes within the gene pool associated with genomic islands. Since genomic islands are frequently associated with a particular microbial adaptation, such as antibiotic resistance, pathogen virulence, or metal resistance, this suggests that microbes may have access to a larger “arsenal” of novel genes for adaptation than previously thought
The Association of Virulence Factors with Genomic Islands
Background: It has been noted that many bacterial virulence factor genes are located within genomic islands (GIs; clusters of genes in a prokaryotic genome of probable horizontal origin). However, such studies have been limited to single genera or isolated observations. We have performed the first large-scale analysis of multiple diverse pathogens to examine this association. We additionally identified genes found predominantly in pathogens, but not non-pathogens, across multiple genera using 631 complete bacterial genomes, and we identified common trends in virulence for genes in GIs. Furthermore, we examined the relationship between GIs and clustered regularly interspaced palindromic repeats (CRISPRs) proposed to confer resistance to phage. Methodology/Principal Findings: We show quantitatively that GIs disproportionately contain more virulence factors than the rest of a given genome (p,1E-40 using three GI datasets) and that CRISPRs are also over-represented in GIs. Virulence factors in GIs and pathogen-associated virulence factors are enriched for proteins having more ‘‘offensive’ ’ functions, e.g. active invasion of the host, and are disproportionately components of type III/IV secretion systems or toxins. Numerous hypothetical pathogen-associated genes were identified, meriting further study. Conclusions/Significance: This is the first systematic analysis across diverse genera indicating that virulence factors are disproportionately associated with GIs. ‘‘Offensive’ ’ virulence factors, as opposed to host-interaction factors, may more ofte
Testing and healthcare seeking behavior preceding HIV diagnosis among migrant and non-migrant individuals living in the Netherlands: Directions for early-case finding
OBJECTIVES: To assess differences in socio-demographics, HIV testing and healthcare seeking behavior between individuals diagnosed late and those diagnosed early after HIV-acquisition. DESIGN: Cross-sectional study among recently HIV-diagnosed migrant and non-migrant individuals living in the Netherlands. METHODS: Participants self-completed a questionnaire on socio-demographics, HIV-testing and healthcare seeking behavior preceding HIV diagnosis between 2013-2015. Using multivariable logistic regression, socio-demographic determinants of late diagnosis were explored. Variables on HIV-infection, testing and access to care preceding HIV diagnosis were compared between those diagnosed early and those diagnosed late using descriptive statistics. RESULTS: We included 143 individuals with early and 101 with late diagnosis, of whom respectively 59/143 (41%) and 54/101 (53%) were migrants. Late diagnosis was significantly associated with older age and being heterosexual. Before HIV diagnosis, 89% of those with early and 62% of those with late diagnosis had ever been tested for HIV-infection (p<0.001), and respectively 99% and 97% reported healthcare usage in the Netherlands in the two years preceding HIV diagnosis (p = 0.79). Individuals diagnosed late most frequently visited a general practitioner (72%) or dentist (62%), and 20% had been hospitalized preceding diagnosis. In these settings, only in respectively 20%, 2%, and 6% HIV-testing was discussed. CONCLUSION: A large proportion of people diagnosed late had previously tested for HIV and had high levels of healthcare usage. For earlier-case finding of HIV it therefore seems feasible to successfully roll out interventions within the existing healthcare system. Simultaneously, efforts should be made to encourage future repeated or routine HIV testing among individuals whenever they undergo an HIV test
- …