11,209 research outputs found
A Mutual Information Based Sequence Distance For Vertebrate Phylogeny Using Complete Mitochondrial Genomes
Traditional sequence distances require alignment. A new mutual information based sequence distance without alignment is defined in this paper. This distance is based on compositional vectors of DNA sequences or protein sequences from complete genomes. First we establish the mathematical foundation of this distance. Then this distance is applied to analyze the phylogenetic relationship of 64 vertebrates using complete mitochondrial genomes. The phylogenetic tree shows that the mitochondrial genomes are separated into three major groups. One group corresponds to mammals; one group corresponds to fish; and the last one is Archosauria (including birds and reptiles). The structure of the tree based on our new distance is roughly in agreement in topology with the current known phylogenies of vertebrates
The similarity metric
A new class of distances appropriate for measuring similarity relations
between sequences, say one type of similarity per distance, is studied. We
propose a new ``normalized information distance'', based on the noncomputable
notion of Kolmogorov complexity, and show that it is in this class and it
minorizes every computable distance in the class (that is, it is universal in
that it discovers all computable similarities). We demonstrate that it is a
metric and call it the {\em similarity metric}. This theory forms the
foundation for a new practical tool. To evidence generality and robustness we
give two distinctive applications in widely divergent areas using standard
compression programs like gzip and GenCompress. First, we compare whole
mitochondrial genomes and infer their evolutionary history. This results in a
first completely automatic computed whole mitochondrial phylogeny tree.
Secondly, we fully automatically compute the language tree of 52 different
languages.Comment: 13 pages, LaTex, 5 figures, Part of this work appeared in Proc. 14th
ACM-SIAM Symp. Discrete Algorithms, 2003. This is the final, corrected,
version to appear in IEEE Trans Inform. T
Phylogeny of Prokaryotes and Chloroplasts Revealed by a Simple Composition Approach on All Protein Sequences from Complete Genomes Without Sequence Alignment
The complete genomes of living organisms have provided much information on their phylogenetic relationships. Similarly, the complete genomes of chloroplasts have helped to resolve the evolution of this organelle in photosynthetic eukaryotes. In this paper we propose an alternative method of phylogenetic analysis using compositional statistics for all protein sequences from complete genomes. This new method is conceptually simpler than and computationally as fast as the one proposed by Qi et al. (2004b) and Chu et al. (2004). The same data sets used in Qi et al. (2004b) and Chu et al. (2004) are analyzed using the new method. Our distance-based phylogenic tree of the 109 prokaryotes and eukaryotes agrees with the biologists tree of life based on 16S rRNA comparison in a predominant majority of basic branching and most lower taxa. Our phylogenetic analysis also shows that the chloroplast genomes are separated to two major clades corresponding to chlorophytes s.l. and rhodophytes s.l. The interrelationships among the chloroplasts are largely in agreement with the current understanding on chloroplast evolution
Clustering by compression
We present a new method for clustering based on compression. The method
doesn't use subject-specific features or background knowledge, and works as
follows: First, we determine a universal similarity distance, the normalized
compression distance or NCD, computed from the lengths of compressed data files
(singly and in pairwise concatenation). Second, we apply a hierarchical
clustering method. The NCD is universal in that it is not restricted to a
specific application area, and works across application area boundaries. A
theoretical precursor, the normalized information distance, co-developed by one
of the authors, is provably optimal but uses the non-computable notion of
Kolmogorov complexity. We propose precise notions of similarity metric, normal
compressor, and show that the NCD based on a normal compressor is a similarity
metric that approximates universality. To extract a hierarchy of clusters from
the distance matrix, we determine a dendrogram (binary tree) by a new quartet
method and a fast heuristic to implement it. The method is implemented and
available as public software, and is robust under choice of different
compressors. To substantiate our claims of universality and robustness, we
report evidence of successful application in areas as diverse as genomics,
virology, languages, literature, music, handwritten digits, astronomy, and
combinations of objects from completely different domains, using statistical,
dictionary, and block sorting compressors. In genomics we presented new
evidence for major questions in Mammalian evolution, based on
whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta
hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure
A New Quartet Tree Heuristic for Hierarchical Clustering
We consider the problem of constructing an an optimal-weight tree from the
3*(n choose 4) weighted quartet topologies on n objects, where optimality means
that the summed weight of the embedded quartet topologiesis optimal (so it can
be the case that the optimal tree embeds all quartets as non-optimal
topologies). We present a heuristic for reconstructing the optimal-weight tree,
and a canonical manner to derive the quartet-topology weights from a given
distance matrix. The method repeatedly transforms a bifurcating tree, with all
objects involved as leaves, achieving a monotonic approximation to the exact
single globally optimal tree. This contrasts to other heuristic search methods
from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly,
incrementally construct a solution from a random order of objects, and
subsequently add agreement values.Comment: 22 pages, 14 figure
Mapping the Space of Genomic Signatures
We propose a computational method to measure and visualize interrelationships
among any number of DNA sequences allowing, for example, the examination of
hundreds or thousands of complete mitochondrial genomes. An "image distance" is
computed for each pair of graphical representations of DNA sequences, and the
distances are visualized as a Molecular Distance Map: Each point on the map
represents a DNA sequence, and the spatial proximity between any two points
reflects the degree of structural similarity between the corresponding
sequences. The graphical representation of DNA sequences utilized, Chaos Game
Representation (CGR), is genome- and species-specific and can thus act as a
genomic signature. Consequently, Molecular Distance Maps could inform species
identification, taxonomic classifications and, to a certain extent,
evolutionary history. The image distance employed, Structural Dissimilarity
Index (DSSIM), implicitly compares the occurrences of oligomers of length up to
(herein ) in DNA sequences. We computed DSSIM distances for more than
5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional
Scaling (MDS) to obtain Molecular Distance Maps that visually display the
sequence relatedness in various subsets, at different taxonomic levels. This
general-purpose method does not require DNA sequence homology and can thus be
used to compare similar or vastly different DNA sequences, genomic or
computer-generated, of the same or different lengths. We illustrate potential
uses of this approach by applying it to several taxonomic subsets: phylum
Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class
Amphibia, and order Primates. This analysis of an extensive dataset confirms
that the oligomer composition of full mtDNA sequences can be a source of
taxonomic information.Comment: 14 pages, 7 figures. arXiv admin note: substantial text overlap with
arXiv:1307.375
The Echinococcus canadensis (G7) genome: A key knowledge of parasitic platyhelminth human diseases
Background: The parasite Echinococcus canadensis (G7) (phylum Platyhelminthes, class Cestoda) is one of the causative agents of echinococcosis. Echinococcosis is a worldwide chronic zoonosis affecting humans as well as domestic and wild mammals, which has been reported as a prioritized neglected disease by the World Health Organisation. No genomic data, comparative genomic analyses or efficient therapeutic and diagnostic tools are available for this severe disease. The information presented in this study will help to understand the peculiar biological characters and to design species-specific control tools. Results: We sequenced, assembled and annotated the 115-Mb genome of E. canadensis (G7). Comparative genomic analyses using whole genome data of three Echinococcus species not only confirmed the status of E. canadensis (G7) as a separate species but also demonstrated a high nucleotide sequences divergence in relation to E. granulosus (G1). The E. canadensis (G7) genome contains 11,449 genes with a core set of 881 orthologs shared among five cestode species. Comparative genomics revealed that there are more single nucleotide polymorphisms (SNPs) between E. canadensis (G7) and E. granulosus (G1) than between E. canadensis (G7) and E. multilocularis. This result was unexpected since E. canadensis (G7) and E. granulosus (G1) were considered to belong to the species complex E. granulosus sensu lato. We described SNPs in known drug targets and metabolism genes in the E. canadensis (G7) genome. Regarding gene regulation, we analysed three particular features: CpG island distribution along the three Echinococcus genomes, DNA methylation system and small RNA pathway. The results suggest the occurrence of yet unknown gene regulation mechanisms in Echinococcus. Conclusions: This is the first work that addresses Echinococcus comparative genomics. The resources presented here will promote the study of mechanisms of parasite development as well as new tools for drug discovery. The availability of a high-quality genome assembly is critical for fully exploring the biology of a pathogenic organism. The E. canadensis (G7) genome presented in this study provides a unique opportunity to address the genetic diversity among the genus Echinococcus and its particular developmental features. At present, there is no unequivocal taxonomic classification of Echinococcus species; however, the genome-wide SNPs analysis performed here revealed the phylogenetic distance among these three Echinococcus species. Additional cestode genomes need to be sequenced to be able to resolve their phylogeny.Fil: Maldonado, Lucas Luciano. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica; ArgentinaFil: Assis, Juliana. Fundación Oswaldo Cruz; BrasilFil: Gomes Araújo, Flávio M.. Fundación Oswaldo Cruz; BrasilFil: Salim, Anna C. M.. Fundación Oswaldo Cruz; BrasilFil: Macchiaroli, Natalia. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica; ArgentinaFil: Cucher, Marcela Alejandra. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica; ArgentinaFil: Camicia, Federico. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica; ArgentinaFil: Fox, Adolfo. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica; ArgentinaFil: Rosenzvit, Mara Cecilia. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica; ArgentinaFil: Oliveira, Guilherme. Instituto Tecnológico Vale; Brasil. Fundación Oswaldo Cruz; BrasilFil: Kamenetzky, Laura. Consejo Nacional de Investigaciones CientÃficas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en MicrobiologÃa y ParasitologÃa Médica; Argentin
Accounting for molecular stochasticity in systematic revisions: species limits and phylogeny of Paroaria
Different frameworks have been proposed for using molecular data in systematic revisions, but there is ongoing debate on their applicability, merits and shortcomings. In this paper we examine the fit between morphological and molecular data in the systematic revision of Paroaria, a group of conspicuous songbirds endemic to South America. We delimited species based on examination of > 600 specimens, and developed distance-gap, and distance- and character-based coalescent simulations to test species limits with molecular data. The morphological and molecular data collected were then analyzed using parsimony, maximum likelihood, and Bayesian phylogenetics. The simulations were better at evaluating the new species limits than using genetic distances. Species diversity within Paroaria had been underestimated by 60%, and the revised genus comprises eight species. Phylogenetic analyses consistently recovered a congruent topology for the most recently derived species in the genus, but the most basal divergences were not resolved with these data. The systematic and phylogenetic hypotheses developed here are relevant to both setting conservation priorities and understanding the biogeography of South America. 

Mitochondrial DNA lineages of Italian Giara and Sarcidano horses
Giara and Sarcidano are 2 of the 15 extant native Italian horse breeds with limited dispersal capability that originated from a larger number of individuals. The 2 breeds live in two distinct isolated locations on the island of Sardinia. To determine the genetic structure and evolutionary history of these 2 Sardinian breeds, the first hypervariable segment of the mitochondrial DNA (mtDNA) was sequenced and analyzed in 40 Giara and Sarcidano horses and compared with publicly available mtDNA data from 43 Old World breeds. Four different analyses, including genetic distance, analysis of molecular variance, haplotype sharing, and clustering methods, were used to study the genetic relationships between the Sardinian and other horse breeds. The analyses yielded similar results, and the FST values indicated that a high percentage of the total genetic variation was explained by between-breed differences. Consistent with their distinct phenotypes and geographic isolation, the two Sardinian breeds were shown to consist of 2 distinct gene pools that had no gene flow between them. Giara horses were clearly separated from the other breeds examined and showed traces of ancient separation from horses of other breeds that share the same mitochondrial lineage. On the other hand, the data from the Sarcidano horses fit well with variation among breeds from the Iberian Peninsula and North-West Europe: genetic relationships among Sarcidano and the other breeds are consistent with the documented history of this breed
- …