1,567 research outputs found
Discriminative Measures for Comparison of Phylogenetic Trees
In this paper we introduce and study three new measures for efficient discriminative comparison of phylogenetic trees. The NNI navigation dissimilarity counts the steps along a âcombingâ of the Nearest Neighbor Interchange (NNI) graph of binary hierarchies, providing an efficient approximation to the (NP-hard) NNI distance in terms of âedit lengthâ. At the same time, a closed form formula for presents it as a weighted count of pairwise incompatibilities between clusters, lending it the character of an edge dissimilarity measure as well. A relaxation of this formula to a simple count yields another measure on all trees â the crossing dissimilarity . Both dissimilarities are symmetric and positive definite (vanish only between identical trees) on binary hierarchies but they fail to satisfy the triangle inequality. Nevertheless, both are bounded below by the widely used RobinsonâFoulds metric and bounded above by a closely related true metric, the cluster-cardinality metric . We show that each of the three proposed new dissimilarities is computable in time O() in the number of leaves , and conclude the paper with a brief numerical exploration of the distribution over tree space of these dissimilarities in comparison with the RobinsonâFoulds metric and the more recently introduced matching-split distance.
For more information: Kod*La
Cophenetic metrics for phylogenetic trees, after Sokal and Rohlf
Phylogenetic tree comparison metrics are an important tool in the study of
evolution, and hence the definition of such metrics is an interesting problem
in phylogenetics. In a paper in Taxon fifty years ago, Sokal and Rohlf proposed
to measure quantitatively the difference between a pair of phylogenetic trees
by first encoding them by means of their half-matrices of cophenetic values,
and then comparing these matrices. This idea has been used several times since
then to define dissimilarity measures between phylogenetic trees but, to our
knowledge, no proper metric on weighted phylogenetic trees with nested taxa
based on this idea has been formally defined and studied yet. Actually, the
cophenetic values of pairs of different taxa alone are not enough to single out
phylogenetic trees with weighted arcs or nested taxa. In this paper we define a
family of cophenetic metrics that compare phylogenetic trees on a same set of
taxa by encoding them by means of their vectors of cophenetic values of pairs
of taxa and depths of single taxa, and then computing the norm of the
difference of the corresponding vectors. Then, we study, either analytically or
numerically, some of their basic properties: neighbors, diameter, distribution,
and their rank correlation with each other and with other metrics.Comment: The "authors' cut" of a paper published in BMC Bioinformatics 14:3
(2013). 46 page
Microbial Similarity between Students in a Common Dormitory Environment Reveals the Forensic Potential of Individual Microbial Signatures.
The microbiota of the built environment is an amalgamation of both human and environmental sources. While human sources have been examined within single-family households or in public environments, it is unclear what effect a large number of cohabitating people have on the microbial communities of their shared environment. We sampled the public and private spaces of a college dormitory, disentangling individual microbial signatures and their impact on the microbiota of common spaces. We compared multiple methods for marker gene sequence clustering and found that minimum entropy decomposition (MED) was best able to distinguish between the microbial signatures of different individuals and was able to uncover more discriminative taxa across all taxonomic groups. Further, weighted UniFrac- and random forest-based graph analyses uncovered two distinct spheres of hand- or shoe-associated samples. Using graph-based clustering, we identified spheres of interaction and found that connection between these clusters was enriched for hands, implicating them as a primary means of transmission. In contrast, shoe-associated samples were found to be freely interacting, with individual shoes more connected to each other than to the floors they interact with. Individual interactions were highly dynamic, with groups of samples originating from individuals clustering freely with samples from other individuals, while all floor and shoe samples consistently clustered together.IMPORTANCE Humans leave behind a microbial trail, regardless of intention. This may allow for the identification of individuals based on the "microbial signatures" they shed in built environments. In a shared living environment, these trails intersect, and through interaction with common surfaces may become homogenized, potentially confounding our ability to link individuals to their associated microbiota. We sought to understand the factors that influence the mixing of individual signatures and how best to process sequencing data to best tease apart these signatures
RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison
Many algorithms for sequence analysis rely on word matching or word
statistics. Often, these approaches can be improved if binary patterns
representing match and don't-care positions are used as a filter, such that
only those positions of words are considered that correspond to the match
positions of the patterns. The performance of these approaches, however,
depends on the underlying patterns. Herein, we show that the overlap complexity
of a pattern set that was introduced by Ilie and Ilie is closely related to the
variance of the number of matches between two evolutionarily related sequences
with respect to this pattern set. We propose a modified hill-climbing algorithm
to optimize pattern sets for database searching, read mapping and
alignment-free sequence comparison of nucleic-acid sequences; our
implementation of this algorithm is called rasbhari. Depending on the
application at hand, rasbhari can either minimize the overlap complexity of
pattern sets, maximize their sensitivity in database searching or minimize the
variance of the number of pattern-based matches in alignment-free sequence
comparison. We show that, for database searching, rasbhari generates pattern
sets with slightly higher sensitivity than existing approaches. In our Spaced
Words approach to alignment-free sequence comparison, pattern sets calculated
with rasbhari led to more accurate estimates of phylogenetic distances than the
randomly generated pattern sets that we previously used. Finally, we used
rasbhari to generate patterns for short read classification with CLARK-S. Here
too, the sensitivity of the results could be improved, compared to the default
patterns of the program. We integrated rasbhari into Spaced Words; the source
code of rasbhari is freely available at http://rasbhari.gobics.de
Detailed evaluation of data analysis tools for subtyping of bacterial isolates based on whole genome sequencing : Neisseria meningitidis as a proof of concept
Whole genome sequencing is increasingly recognized as the most informative approach for characterization of bacterial isolates. Success of the routine use of this technology in public health laboratories depends on the availability of well-characterized and verified data analysis methods. However, multiple subtyping workflows are now often being used for a single organism, and differences between them are not always well described. Moreover, methodologies for comparison of subtyping workflows, and assessment of their performance are only beginning to emerge. Current work focuses on the detailed comparison of WGS-based subtyping workflows and evaluation of their suitability for the organism and the research context in question. We evaluated the performance of pipelines used for subtyping of Neisseria meningitidis, including the currently widely applied cgMLST approach and different SNP-based methods. In addition, the impact of the use of different tools for detection and filtering of recombinant regions and of different reference genomes were tested. Our benchmarking analysis included both assessment of technical performance of the pipelines and functional comparison of the generated genetic distance matrices and phylogenetic trees. It was carried out using replicate sequencing datasets of high- and low-coverage, consisting mainly of isolates belonging to the clonal complex 269. We demonstrated that cgMLST and some of the SNP-based subtyping workflows showed very good performance characteristics and highly similar genetic distance matrices and phylogenetic trees with isolates belonging to the same clonal complex. However, only two of the tested workflows demonstrated reproducible results for a group of more closely related isolates. Additionally, results of the SNP-based subtyping workflows were to some level dependent on the reference genome used. Interestingly, the use of recombination-filtering software generally reduced the similarity between the gene-by-gene and SNP-based methodologies for subtyping of N. meningitidis. Our study, where N. meningitidis was taken as an example, clearly highlights the need for more benchmarking comparative studies to eventually contribute to a justified use of a specific WGS data analysis workflow within an international public health laboratory context
Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments
Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes
and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage
display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative
approach for predicting PRM-mediated protein-protein interactions from sequence data. The
model suffered from over-fitting, so Laplacian regularisation was found to be important in
achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative
model. We also propose another discriminative model which can be applied to all sequences
present in the organism at a significantly lower computational cost. This is due to its additional
assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small
number of instances of each binding site motif. However, closely related species are expected
to share similar binding sites, which would be expected to be highly conserved. We investigated
rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic
tree can represent the relationships and divergences between the taxa. However, taxa sequences
exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites,
and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined
the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments:
one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried
out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo
- âŠ