240 research outputs found

    Gene Function Classification Using Bayesian Models with Hierarchy-Based Priors

    Get PDF
    We investigate the application of hierarchical classification schemes to the annotation of gene function based on several characteristics of protein sequences including phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit (MNL) model, a hierarchical model based on a set of nested MNL models, and a MNL model with a prior that introduces correlations between the parameters for classes that are nearby in the hierarchy. We also provide a new scheme for combining different sources of information. We use these models to predict the functional class of Open Reading Frames (ORFs) from the E. coli genome. The results from all three models show substantial improvement over previous methods, which were based on the C5 algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining these sources of information, our approach results in a higher accuracy rate when compared to models that use each data source alone. Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information

    The effects of weather and climate change on dengue

    Get PDF
    There is much uncertainty about the future impact of climate change on vector-borne diseases. Such uncertainty reflects the difficulties in modelling the complex interactions between disease, climatic and socioeconomic determinants. We used a comprehensive panel dataset from Mexico covering 23 years of province-specific dengue reports across nine climatic regions to estimate the impact of weather on dengue, accounting for the effects of non-climatic factors

    Effects of temperature on the transmission of Yersinia Pestis by the flea, Xenopsylla Cheopis, in the late phase period

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Traditionally, efficient flea-borne transmission of <it>Yersinia pestis</it>, the causative agent of plague, was thought to be dependent on a process referred to as blockage in which biofilm-mediated growth of the bacteria physically blocks the flea gut, leading to the regurgitation of contaminated blood into the host. This process was previously shown to be temperature-regulated, with blockage failing at temperatures approaching 30°C; however, the abilities of fleas to transmit infections at different temperatures had not been adequately assessed. We infected colony-reared fleas of <it>Xenopsylla cheopis </it>with a wild type strain of <it>Y. pestis </it>and maintained them at 10, 23, 27, or 30°C. Naïve mice were exposed to groups of infected fleas beginning on day 7 post-infection (p.i.), and every 3-4 days thereafter until day 14 p.i. for fleas held at 10°C, or 28 days p.i. for fleas held at 23-30°C. Transmission was confirmed using <it>Y. pestis</it>-specific antigen or antibody detection assays on mouse tissues.</p> <p>Results</p> <p>Although no statistically significant differences in per flea transmission efficiencies were detected between 23 and 30°C, efficiencies were highest for fleas maintained at 23°C and they began to decline at 27 and 30°C by day 21 p.i. These declines coincided with declining median bacterial loads in fleas at 27 and 30°C. Survival and feeding rates of fleas also varied by temperature to suggest fleas at 27 and 30°C would be less likely to sustain transmission than fleas maintained at 23°C. Fleas held at 10°C transmitted <it>Y. pestis </it>infections, although flea survival was significantly reduced compared to that of uninfected fleas at this temperature. Median bacterial loads were significantly higher at 10°C than at the other temperatures.</p> <p>Conclusions</p> <p>Our results suggest that temperature does not significantly effect the per flea efficiency of <it>Y. pestis </it>transmission by <it>X. cheopis</it>, but that temperature is likely to influence the dynamics of <it>Y. pestis </it>flea-borne transmission, perhaps by affecting persistence of the bacteria in the flea gut or by influencing flea survival. Whether <it>Y. pestis </it>biofilm production is important for transmission at different temperatures remains unresolved, although our results support the hypothesis that blockage is not necessary for efficient transmission.</p

    Chromosome 15q25 (CHRNA3-CHRNA5) Variation Impacts Indirectly on Lung Cancer Risk

    Get PDF
    Genetic variants at the 15q25 CHRNA5-CHRNA3 locus have been shown to influence lung cancer risk however there is controversy as to whether variants have a direct carcinogenic effect on lung cancer risk or impact indirectly through smoking behavior. We have performed a detailed analysis of the 15q25 risk variants rs12914385 and rs8042374 with smoking behavior and lung cancer risk in 4,343 lung cancer cases and 1,479 controls from the Genetic Lung Cancer Predisposition Study (GELCAPS). A strong association between rs12914385 and rs8042374, and lung cancer risk was shown, odds ratios (OR) were 1.44, (95% confidence interval (CI): 1.29–1.62, P = 3.69×10−10) and 1.35 (95% CI: 1.18–1.55, P = 9.99×10−6) respectively. Each copy of risk alleles at rs12914385 and rs8042374 was associated with increased cigarette consumption of 1.0 and 0.9 cigarettes per day (CPD) (P = 5.18×10−5 and P = 5.65×10−3). These genetically determined modest differences in smoking behavior can be shown to be sufficient to account for the 15q25 association with lung cancer risk. To further verify the indirect effect of 15q25 on the risk, we restricted our analysis of lung cancer risk to never-smokers and conducted a meta-analysis of previously published studies of lung cancer risk in never-smokers. Never-smoker studies published in English were ascertained from PubMed stipulating - lung cancer, risk, genome-wide association, candidate genes. Our study and five previously published studies provided data on 2,405 never-smoker lung cancer cases and 7,622 controls. In the pooled analysis no association has been found between the 15q25 variation and lung cancer risk (OR = 1.09, 95% CI: 0.94–1.28). This study affirms the 15q25 association with smoking and is consistent with an indirect link between genotype and lung cancer risk

    Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools.</p> <p>Results</p> <p>We present a computational improvement to a sequence clustering approach that we developed previously to identify and classify protein coding genes in large microbial metagenomic datasets. The clustering approach can be used to identify protein coding genes in prokaryotes, viruses, and intron-less eukaryotes. The computational improvement is based on an incremental clustering method that does not require the expensive all-against-all compute that was required by the original approach, while still preserving the remote homology detection capabilities. We present evaluations of the clustering approach in protein-coding gene identification and classification, and also present the results of updating the protein clusters from our previous work with recent genomic and metagenomic sequences. The clustering results are available via CAMERA, (http://camera.calit2.net).</p> <p>Conclusion</p> <p>The clustering paradigm is shown to be a very useful tool in the analysis of microbial metagenomic data. The incremental clustering method is shown to be much faster than the original approach in identifying genes, grouping sequences into existing protein families, and also identifying novel families that have multiple members in a metagenomic dataset. These clusters provide a basis for further studies of protein families.</p

    A Complex Cell Division Machinery Was Present in the Last Common Ancestor of Eukaryotes

    Get PDF
    Background: The midbody is a transient complex structure containing proteins involved in cytokinesis. Up to now, it has been described only in Metazoa. Other eukaryotes present a variety of structures implied in the last steps of cell division, such as the septum in fungi or the phragmoplast in plants. However, it is unclear whether these structures are homologous (derive from a common ancestral structure) or analogous (have distinct evolutionary origins). Recently, the proteome of the hamster midbody has been characterized and 160 proteins identified. Methodology/Principal Findings: Using phylogenomic approaches, we show here that nearly all of these 160 proteins (95%) are conserved across metazoan lineages. More surprisingly, we show that a large part of the mammalian midbody components (91 proteins) were already present in the last common ancestor of all eukaryotes (LECA) and were most likely involved in the construction of a complex multi-protein assemblage acting in cell division. Conclusions/Significance: Our results indicate that the midbodies of non-mammalian metazoa are likely very similar to the mammalian one and that the ancestor of Metazoa possessed a nearly modern midbody. Moreover, our analyses support the hypothesis that the midbody and the structures involved in cytokinesis in other eukaryotes derive from a large and complex structure present in LECA, likely involved in cytokinesis. This is an additional argument in favour of the idea of a comple

    SIP metagenomics identifies uncultivated Methylophilaceae as dimethylsulphide degrading bacteria in soil and lake sediment.

    Get PDF
    Dimethylsulphide (DMS) has an important role in the global sulphur cycle and atmospheric chemistry. Microorganisms using DMS as sole carbon, sulphur or energy source, contribute to the cycling of DMS in a wide variety of ecosystems. The diversity of microbial populations degrading DMS in terrestrial environments is poorly understood. Based on cultivation studies, a wide range of bacteria isolated from terrestrial ecosystems were shown to be able to degrade DMS, yet it remains unknown whether any of these have important roles in situ. In this study, we identified bacteria using DMS as a carbon and energy source in terrestrial environments, an agricultural soil and a lake sediment, by DNA stable isotope probing (SIP). Microbial communities involved in DMS degradation were analysed by denaturing gradient gel electrophoresis, high-throughput sequencing of SIP gradient fractions and metagenomic sequencing of phi29-amplified community DNA. Labelling patterns of time course SIP experiments identified members of the Methylophilaceae family, not previously implicated in DMS degradation, as dominant DMS-degrading populations in soil and lake sediment. Thiobacillus spp. were also detected in (13)C-DNA from SIP incubations. Metagenomic sequencing also suggested involvement of Methylophilaceae in DMS degradation and further indicated shifts in the functional profile of the DMS-assimilating communities in line with methylotrophy and oxidation of inorganic sulphur compounds. Overall, these data suggest that unlike in the marine environment where gammaproteobacterial populations were identified by SIP as DMS degraders, betaproteobacterial Methylophilaceae may have a key role in DMS cycling in terrestrial environments.HS was supported by a UK Natural Environment Research Council Advanced Fellowship NE/E013333/1), ÖE by a postgraduate scholarship from the University of Warwick and an Early Career Fellowship from the Institute of Advanced Study, University of Warwick, UK, respectively. Lawrence Davies is acknowledged for help with QIIME

    Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration

    Get PDF
    BACKGROUND: The automation of many common molecular biology techniques has resulted in the accumulation of vast quantities of experimental data. One of the major challenges now facing researchers is how to process this data to yield useful information about a biological system (e.g. knowledge of genes and their products, and the biological roles of proteins, their molecular functions, localizations and interaction networks). We present a technique called Global Mapping of Unknown Proteins (GMUP) which uses the Gene Ontology Index to relate diverse sources of experimental data by creation of an abstraction layer of evidence data. This abstraction layer is used as input to a neural network which, once trained, can be used to predict function from the evidence data of unannotated proteins. The method allows us to include almost any experimental data set related to protein function, which incorporates the Gene Ontology, to our evidence data in order to seek relationships between the different sets. RESULTS: We have demonstrated the capabilities of this method in two ways. We first collected various experimental datasets associated with yeast (Saccharomyces cerevisiae) and applied the technique to a set of previously annotated open reading frames (ORFs). These ORFs were divided into training and test sets and were used to examine the accuracy of the predictions made by our method. Then we applied GMUP to previously un-annotated ORFs and made 1980, 836 and 1969 predictions corresponding to the GO Biological Process, Molecular Function and Cellular Component sub-categories respectively. We found that GMUP was particularly successful at predicting ORFs with functions associated with the ribonucleoprotein complex, protein metabolism and transportation. CONCLUSION: This study presents a global and generic gene knowledge discovery approach based on evidence integration of various genome-scale data. It can be used to provide insight as to how certain biological processes are implemented by interaction and coordination of proteins, which may serve as a guide for future analysis. New data can be readily incorporated as it becomes available to provide more reliable predictions or further insights into processes and interactions

    A probabilistic framework to predict protein function from interaction data integrated with semantic knowledge

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The functional characterization of newly discovered proteins has been a challenge in the post-genomic era. Protein-protein interactions provide insights into the functional analysis because the function of unknown proteins can be postulated on the basis of their interaction evidence with known proteins. The protein-protein interaction data sets have been enriched by high-throughput experimental methods. However, the functional analysis using the interaction data has a limitation in accuracy because of the presence of the false positive data experimentally generated and the interactions that are a lack of functional linkage.</p> <p>Results</p> <p>Protein-protein interaction data can be integrated with the functional knowledge existing in the Gene Ontology (GO) database. We apply similarity measures to assess the functional similarity between interacting proteins. We present a probabilistic framework for predicting functions of unknown proteins based on the functional similarity. We use the leave-one-out cross validation to compare the performance. The experimental results demonstrate that our algorithm performs better than other competing methods in terms of prediction accuracy. In particular, it handles the high false positive rates of current interaction data well.</p> <p>Conclusion</p> <p>The experimentally determined protein-protein interactions are erroneous to uncover the functional associations among proteins. The performance of function prediction for uncharacterized proteins can be enhanced by the integration of multiple data sources available.</p
    corecore