1,567 research outputs found

    Discriminative Measures for Comparison of Phylogenetic Trees

    Get PDF
    In this paper we introduce and study three new measures for efficient discriminative comparison of phylogenetic trees. The NNI navigation dissimilarity dnavd_{nav} counts the steps along a “combing” of the Nearest Neighbor Interchange (NNI) graph of binary hierarchies, providing an efficient approximation to the (NP-hard) NNI distance in terms of “edit length”. At the same time, a closed form formula for dnavd_{nav} presents it as a weighted count of pairwise incompatibilities between clusters, lending it the character of an edge dissimilarity measure as well. A relaxation of this formula to a simple count yields another measure on all trees — the crossing dissimilarity dCMd_{CM}. Both dissimilarities are symmetric and positive definite (vanish only between identical trees) on binary hierarchies but they fail to satisfy the triangle inequality. Nevertheless, both are bounded below by the widely used Robinson–Foulds metric and bounded above by a closely related true metric, the cluster-cardinality metric dCCd_{CC}. We show that each of the three proposed new dissimilarities is computable in time O(n2n^2) in the number of leaves nn, and conclude the paper with a brief numerical exploration of the distribution over tree space of these dissimilarities in comparison with the Robinson–Foulds metric and the more recently introduced matching-split distance. For more information: Kod*La

    Cophenetic metrics for phylogenetic trees, after Sokal and Rohlf

    Get PDF
    Phylogenetic tree comparison metrics are an important tool in the study of evolution, and hence the definition of such metrics is an interesting problem in phylogenetics. In a paper in Taxon fifty years ago, Sokal and Rohlf proposed to measure quantitatively the difference between a pair of phylogenetic trees by first encoding them by means of their half-matrices of cophenetic values, and then comparing these matrices. This idea has been used several times since then to define dissimilarity measures between phylogenetic trees but, to our knowledge, no proper metric on weighted phylogenetic trees with nested taxa based on this idea has been formally defined and studied yet. Actually, the cophenetic values of pairs of different taxa alone are not enough to single out phylogenetic trees with weighted arcs or nested taxa. In this paper we define a family of cophenetic metrics that compare phylogenetic trees on a same set of taxa by encoding them by means of their vectors of cophenetic values of pairs of taxa and depths of single taxa, and then computing the LpL^p norm of the difference of the corresponding vectors. Then, we study, either analytically or numerically, some of their basic properties: neighbors, diameter, distribution, and their rank correlation with each other and with other metrics.Comment: The "authors' cut" of a paper published in BMC Bioinformatics 14:3 (2013). 46 page

    Microbial Similarity between Students in a Common Dormitory Environment Reveals the Forensic Potential of Individual Microbial Signatures.

    Get PDF
    The microbiota of the built environment is an amalgamation of both human and environmental sources. While human sources have been examined within single-family households or in public environments, it is unclear what effect a large number of cohabitating people have on the microbial communities of their shared environment. We sampled the public and private spaces of a college dormitory, disentangling individual microbial signatures and their impact on the microbiota of common spaces. We compared multiple methods for marker gene sequence clustering and found that minimum entropy decomposition (MED) was best able to distinguish between the microbial signatures of different individuals and was able to uncover more discriminative taxa across all taxonomic groups. Further, weighted UniFrac- and random forest-based graph analyses uncovered two distinct spheres of hand- or shoe-associated samples. Using graph-based clustering, we identified spheres of interaction and found that connection between these clusters was enriched for hands, implicating them as a primary means of transmission. In contrast, shoe-associated samples were found to be freely interacting, with individual shoes more connected to each other than to the floors they interact with. Individual interactions were highly dynamic, with groups of samples originating from individuals clustering freely with samples from other individuals, while all floor and shoe samples consistently clustered together.IMPORTANCE Humans leave behind a microbial trail, regardless of intention. This may allow for the identification of individuals based on the "microbial signatures" they shed in built environments. In a shared living environment, these trails intersect, and through interaction with common surfaces may become homogenized, potentially confounding our ability to link individuals to their associated microbiota. We sought to understand the factors that influence the mixing of individual signatures and how best to process sequencing data to best tease apart these signatures

    RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

    Full text link
    Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de

    Detailed evaluation of data analysis tools for subtyping of bacterial isolates based on whole genome sequencing : Neisseria meningitidis as a proof of concept

    Get PDF
    Whole genome sequencing is increasingly recognized as the most informative approach for characterization of bacterial isolates. Success of the routine use of this technology in public health laboratories depends on the availability of well-characterized and verified data analysis methods. However, multiple subtyping workflows are now often being used for a single organism, and differences between them are not always well described. Moreover, methodologies for comparison of subtyping workflows, and assessment of their performance are only beginning to emerge. Current work focuses on the detailed comparison of WGS-based subtyping workflows and evaluation of their suitability for the organism and the research context in question. We evaluated the performance of pipelines used for subtyping of Neisseria meningitidis, including the currently widely applied cgMLST approach and different SNP-based methods. In addition, the impact of the use of different tools for detection and filtering of recombinant regions and of different reference genomes were tested. Our benchmarking analysis included both assessment of technical performance of the pipelines and functional comparison of the generated genetic distance matrices and phylogenetic trees. It was carried out using replicate sequencing datasets of high- and low-coverage, consisting mainly of isolates belonging to the clonal complex 269. We demonstrated that cgMLST and some of the SNP-based subtyping workflows showed very good performance characteristics and highly similar genetic distance matrices and phylogenetic trees with isolates belonging to the same clonal complex. However, only two of the tested workflows demonstrated reproducible results for a group of more closely related isolates. Additionally, results of the SNP-based subtyping workflows were to some level dependent on the reference genome used. Interestingly, the use of recombination-filtering software generally reduced the similarity between the gene-by-gene and SNP-based methodologies for subtyping of N. meningitidis. Our study, where N. meningitidis was taken as an example, clearly highlights the need for more benchmarking comparative studies to eventually contribute to a justified use of a specific WGS data analysis workflow within an international public health laboratory context

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo
    • 

    corecore