81 research outputs found

    P-value based visualization of codon usage data

    Get PDF
    Two important and not yet solved problems in bacterial genome research are the identification of horizontally transferred genes and the prediction of gene expression levels. Both problems can be addressed by multivariate analysis of codon usage data. In particular dimensionality reduction methods for visualization of multivariate data have shown to be effective tools for codon usage analysis. We here propose a multidimensional scaling approach using a novel similarity measure for codon usage tables. Our probabilistic similarity measure is based on P-values derived from the well-known chi-square test for comparison of two distributions. Experimental results on four microbial genomes indicate that the new method is well-suited for the analysis of horizontal gene transfer and translational selection. As compared with the widely-used correspondence analysis, our method did not suffer from outlier sensitivity and showed a better clustering of putative alien genes in most cases

    Statistical learning of peptide retention behavior in chromatographic separations: a new kernel-based approach for computational proteomics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput peptide and protein identification technologies have benefited tremendously from strategies based on tandem mass spectrometry (MS/MS) in combination with database searching algorithms. A major problem with existing methods lies within the significant number of false positive and false negative annotations. So far, standard algorithms for protein identification do not use the information gained from separation processes usually involved in peptide analysis, such as retention time information, which are readily available from chromatographic separation of the sample. Identification can thus be improved by comparing measured retention times to predicted retention times. Current prediction models are derived from a set of measured test analytes but they usually require large amounts of training data.</p> <p>Results</p> <p>We introduce a new kernel function which can be applied in combination with support vector machines to a wide range of computational proteomics problems. We show the performance of this new approach by applying it to the prediction of peptide adsorption/elution behavior in strong anion-exchange solid-phase extraction (SAX-SPE) and ion-pair reversed-phase high-performance liquid chromatography (IP-RP-HPLC). Furthermore, the predicted retention times are used to improve spectrum identifications by a <it>p</it>-value-based filtering approach. The approach was tested on a number of different datasets and shows excellent performance while requiring only very small training sets (about 40 peptides instead of thousands). Using the retention time predictor in our retention time filter improves the fraction of correctly identified peptide mass spectra significantly.</p> <p>Conclusion</p> <p>The proposed kernel function is well-suited for the prediction of chromatographic separation in computational proteomics and requires only a limited amount of training data. The performance of this new method is demonstrated by applying it to peptide retention time prediction in IP-RP-HPLC and prediction of peptide sample fractionation in SAX-SPE. Finally, we incorporate the predicted chromatographic behavior in a <it>p</it>-value based filter to improve peptide identifications based on liquid chromatography-tandem mass spectrometry.</p

    Gene prediction in metagenomic fragments: A large scale machine learning approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.</p> <p>Results</p> <p>We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.</p> <p>Conclusion</p> <p>Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).</p

    Factors That Affect Large Subunit Ribosomal DNA Amplicon Sequencing Studies of Fungal Communities: Classification Method, Primer Choice, and Error

    Get PDF
    Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys

    Critical Assessment of Metagenome Interpretation:A benchmark of metagenomics software

    Get PDF
    International audienceIn metagenome analysis, computational methods for assembly, taxonomic profilingand binning are key components facilitating downstream biological datainterpretation. However, a lack of consensus about benchmarking datasets andevaluation metrics complicates proper performance assessment. The CriticalAssessment of Metagenome Interpretation (CAMI) challenge has engaged the globaldeveloper community to benchmark their programs on datasets of unprecedentedcomplexity and realism. Benchmark metagenomes were generated from newlysequenced ~700 microorganisms and ~600 novel viruses and plasmids, includinggenomes with varying degrees of relatedness to each other and to publicly availableones and representing common experimental setups. Across all datasets, assemblyand genome binning programs performed well for species represented by individualgenomes, while performance was substantially affected by the presence of relatedstrains. Taxonomic profiling and binning programs were proficient at high taxonomicranks, with a notable performance decrease below the family level. Parametersettings substantially impacted performances, underscoring the importance ofprogram reproducibility. While highlighting current challenges in computationalmetagenomics, the CAMI results provide a roadmap for software selection to answerspecific research questions
    corecore