Search CORE

11,209 research outputs found

A Mutual Information Based Sequence Distance For Vertebrate Phylogeny Using Complete Mitochondrial Genomes

Author: Anh Vo
Mao Z
Yu Zuguo
Zhou Li-Qian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Traditional sequence distances require alignment. A new mutual information based sequence distance without alignment is defined in this paper. This distance is based on compositional vectors of DNA sequences or protein sequences from complete genomes. First we establish the mathematical foundation of this distance. Then this distance is applied to analyze the phylogenetic relationship of 64 vertebrates using complete mitochondrial genomes. The phylogenetic tree shows that the mitochondrial genomes are separated into three major groups. One group corresponds to mammals; one group corresponds to fish; and the last one is Archosauria (including birds and reptiles). The structure of the tree based on our new distance is roughly in agreement in topology with the current known phylogenies of vertebrates

Crossref

Queensland University of Technology ePrints Archive

The similarity metric

Author: Chen Xin
Li Ming
Li Xin
Ma Bin
Vitanyi Paul
Publication venue
Publication date: 01/01/2003
Field of study

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of Kolmogorov complexity, and show that it is in this class and it minorizes every computable distance in the class (that is, it is universal in that it discovers all computable similarities). We demonstrate that it is a metric and call it the {\em similarity metric}. This theory forms the foundation for a new practical tool. To evidence generality and robustness we give two distinctive applications in widely divergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial genomes and infer their evolutionary history. This results in a first completely automatic computed whole mitochondrial phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages.Comment: 13 pages, LaTex, 5 figures, Part of this work appeared in Proc. 14th ACM-SIAM Symp. Discrete Algorithms, 2003. This is the final, corrected, version to appear in IEEE Trans Inform. T

arXiv.org e-Print Archive

CiteSeerX

International Migration, Integration and Social Cohesion online publications

Phylogeny of Prokaryotes and Chloroplasts Revealed by a Simple Composition Approach on All Protein Sequences from Complete Genomes Without Sequence Alignment

Author: C Lemieux
CR Woese
CR Woese
D Sankoff
DH Moreira
E Chatton
E Mayr
E Pennisi
F Tekaia
FitchWM
GI McFadden
GI McFadden
GW Stuart
GW Stuart
J Adachi
J Las Rivas De
J Lin
J Qi
J Qi
J.Q. Deng
JA Eisen
JD Palmer
JR Brown
K.H. Chu
KH Chu
L.Q. Zhou
M Li
M Turmel
M Turmel
MA Ragan
MW Gray
MW Gray
N Saitou
O Weiss
RF Doolittle
RF Doolittle
RL Charlebois
RS Gupta
S.C. Long
ST Fitz-Gibbon
SV Edwards
V.V. Anh
VL Stirewalt
W Martin
W Martin
W Martin
Z.G. Yu
ZG Yu
ZG Yu
ZG Yu
ZG Yu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

The complete genomes of living organisms have provided much information on their phylogenetic relationships. Similarly, the complete genomes of chloroplasts have helped to resolve the evolution of this organelle in photosynthetic eukaryotes. In this paper we propose an alternative method of phylogenetic analysis using compositional statistics for all protein sequences from complete genomes. This new method is conceptually simpler than and computationally as fast as the one proposed by Qi et al. (2004b) and Chu et al. (2004). The same data sets used in Qi et al. (2004b) and Chu et al. (2004) are analyzed using the new method. Our distance-based phylogenic tree of the 109 prokaryotes and eukaryotes agrees with the biologists tree of life based on 16S rRNA comparison in a predominant majority of basic branching and most lower taxa. Our phylogenetic analysis also shows that the chloroplast genomes are separated to two major clades corresponding to chlorophytes s.l. and rhodophytes s.l. The interrelationships among the chloroplasts are largely in agreement with the current understanding on chloroplast evolution

Crossref

Queensland University of Technology ePrints Archive

Clustering by compression

Author: Cilibrasi Rudi
Vitanyi Paul
Publication venue
Publication date: 09/04/2004
Field of study

We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (binary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure

arXiv.org e-Print Archive

CiteSeerX

CWI's Institutional Repository

International Migration, Integration and Social Cohesion online publications

A New Quartet Tree Heuristic for Hierarchical Clustering

Author: Cilibrasi Rudi
Vitanyi Paul M. B.
Publication venue
Publication date: 01/01/2006
Field of study

We consider the problem of constructing an an optimal-weight tree from the 3*(n choose 4) weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologiesis optimal (so it can be the case that the optimal tree embeds all quartets as non-optimal topologies). We present a heuristic for reconstructing the optimal-weight tree, and a canonical manner to derive the quartet-topology weights from a given distance matrix. The method repeatedly transforms a bifurcating tree, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. This contrasts to other heuristic search methods from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly, incrementally construct a solution from a random order of objects, and subsequently add agreement values.Comment: 22 pages, 14 figure

arXiv.org e-Print Archive

CiteSeerX

Dagstuhl Research Online Publication Server

Mapping the Space of Genomic Signatures

Author: Bryans Nathaniel
Dattani Nikesh S.
Davis Katelyn
Hill Kathleen A.
Karamichalis Rallis
Kari Lila
Sayem Abu S.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 09/10/2014
Field of study

We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to

k

(herein

k=9

) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence homology and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information.Comment: 14 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:1307.375

arXiv.org e-Print Archive

Directory of Open Access Journals

The Echinococcus canadensis (G7) genome: A key knowledge of parasitic platyhelminth human diseases

Author: A Bankevich
A Gurevich
A Lomsadze
A Lomsadze
Adolfo Fox
AM Bolger
Anna C. M. Salim
B Hendrich
B Langmead
C Bermudez-Santana
C Hahn
C Holt
C Jiang
C Trapnell
CA Alvarez Rojas
CCM Budke
D Kim
D Takai
DP McManus
DR Zerbino
E Elkayam
E Keibler
E Quevillon
F Jeanmougin
F Kiefer
F Mohn
Federico Camicia
Flávio M. Gomes Araújo
G Abrusán
G Parra
GSC Slater
Guilherme Oliveira
H Li
H Zheng
I Korf
IJ Tsai
IJ Tsai
J Eckert
JK Nono
JM Bart
JP Hewitson
Juliana Assis
K Arnold
K Matsuo
K Thivierge
K Wasik
KJ Fryxell
KK Geyer
KK Geyer
L Han
L Han
L Kamenetzky
L Kamenetzky
L Kamenetzky
L Li
LA Kelley
Laura Kamenetzky
LD Moore
Lucas L. Maldonado
M Ashburner
M Biasini
M Cucher
M Cucher
M Krzywinski
M Marín
M Nakao
M Nakao
M Nakao
M Nakao
M Nakao
M Rosenzvit
M Sajid
M Stanke
MA Cucher
Mara Rosenzvit
Marcela Cucher
MC Rosenzvit
MW Robinson
N Guex
N Macchiaroli
N Schürmann
Natalia Macchiaroli
ND Young
O Bogdanović
P Carninci
P Cingolani
P Danecek
PM Muzulin
PM Schantz
PS Craig
R Luo
R Schneider
RD Finn
RJ Klose
S Assefa
S Maillard
S Saxonov
S Yi
SF Altschul
SM Sadjjadi
TD Otto
TD Otto
TM Lowe
U Koziol
U Saarma
W Pan
Y Moriya
Y Safonova
YA Medvedeva
Z Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/02/2017
Field of study

Background: The parasite Echinococcus canadensis (G7) (phylum Platyhelminthes, class Cestoda) is one of the causative agents of echinococcosis. Echinococcosis is a worldwide chronic zoonosis affecting humans as well as domestic and wild mammals, which has been reported as a prioritized neglected disease by the World Health Organisation. No genomic data, comparative genomic analyses or efficient therapeutic and diagnostic tools are available for this severe disease. The information presented in this study will help to understand the peculiar biological characters and to design species-specific control tools. Results: We sequenced, assembled and annotated the 115-Mb genome of E. canadensis (G7). Comparative genomic analyses using whole genome data of three Echinococcus species not only confirmed the status of E. canadensis (G7) as a separate species but also demonstrated a high nucleotide sequences divergence in relation to E. granulosus (G1). The E. canadensis (G7) genome contains 11,449 genes with a core set of 881 orthologs shared among five cestode species. Comparative genomics revealed that there are more single nucleotide polymorphisms (SNPs) between E. canadensis (G7) and E. granulosus (G1) than between E. canadensis (G7) and E. multilocularis. This result was unexpected since E. canadensis (G7) and E. granulosus (G1) were considered to belong to the species complex E. granulosus sensu lato. We described SNPs in known drug targets and metabolism genes in the E. canadensis (G7) genome. Regarding gene regulation, we analysed three particular features: CpG island distribution along the three Echinococcus genomes, DNA methylation system and small RNA pathway. The results suggest the occurrence of yet unknown gene regulation mechanisms in Echinococcus. Conclusions: This is the first work that addresses Echinococcus comparative genomics. The resources presented here will promote the study of mechanisms of parasite development as well as new tools for drug discovery. The availability of a high-quality genome assembly is critical for fully exploring the biology of a pathogenic organism. The E. canadensis (G7) genome presented in this study provides a unique opportunity to address the genetic diversity among the genus Echinococcus and its particular developmental features. At present, there is no unequivocal taxonomic classification of Echinococcus species; however, the genome-wide SNPs analysis performed here revealed the phylogenetic distance among these three Echinococcus species. Additional cestode genomes need to be sequenced to be able to resolve their phylogeny.Fil: Maldonado, Lucas Luciano. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Assis, Juliana. Fundación Oswaldo Cruz; BrasilFil: Gomes Araújo, Flávio M.. Fundación Oswaldo Cruz; BrasilFil: Salim, Anna C. M.. Fundación Oswaldo Cruz; BrasilFil: Macchiaroli, Natalia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Cucher, Marcela Alejandra. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Camicia, Federico. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Fox, Adolfo. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Rosenzvit, Mara Cecilia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; ArgentinaFil: Oliveira, Guilherme. Instituto Tecnológico Vale; Brasil. Fundación Oswaldo Cruz; BrasilFil: Kamenetzky, Laura. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Houssay. Instituto de Investigaciones en Microbiología y Parasitología Médica. Universidad de Buenos Aires. Facultad de Medicina. Instituto de Investigaciones en Microbiología y Parasitología Médica; Argentin

Accounting for molecular stochasticity in systematic revisions: species limits and phylogeny of Paroaria

Author: Ana L. Porzecanski
Liliana M. D&#xe1
Publication venue
Publication date: 22/06/2009
Field of study

Different frameworks have been proposed for using molecular data in systematic revisions, but there is ongoing debate on their applicability, merits and shortcomings. In this paper we examine the fit between morphological and molecular data in the systematic revision of Paroaria, a group of conspicuous songbirds endemic to South America. We delimited species based on examination of > 600 specimens, and developed distance-gap, and distance- and character-based coalescent simulations to test species limits with molecular data. The morphological and molecular data collected were then analyzed using parsimony, maximum likelihood, and Bayesian phylogenetics. The simulations were better at evaluating the new species limits than using genetic distances. Species diversity within Paroaria had been underestimated by 60%, and the revised genus comprises eight species. Phylogenetic analyses consistently recovered a congruent topology for the most recently derived species in the genus, but the most basal divergences were not resolved with these data. The systematic and phylogenetic hypotheses developed here are relevant to both setting conservation priorities and understanding the biogeography of South America. &#xa

Nature Precedings

Mitochondrial DNA lineages of Italian Giara and Sarcidano horses

Author: Barbato M.
Cancedda M.
Contu D.
Francalacci P.
Morelli L.
Pala Maria
Sanna D.
Useli A.
Publication venue: 'Genetics and Molecular Research'
Publication date: 01/01/2014
Field of study

Giara and Sarcidano are 2 of the 15 extant native Italian horse breeds with limited dispersal capability that originated from a larger number of individuals. The 2 breeds live in two distinct isolated locations on the island of Sardinia. To determine the genetic structure and evolutionary history of these 2 Sardinian breeds, the first hypervariable segment of the mitochondrial DNA (mtDNA) was sequenced and analyzed in 40 Giara and Sarcidano horses and compared with publicly available mtDNA data from 43 Old World breeds. Four different analyses, including genetic distance, analysis of molecular variance, haplotype sharing, and clustering methods, were used to study the genetic relationships between the Sardinian and other horse breeds. The analyses yielded similar results, and the FST values indicated that a high percentage of the total genetic variation was explained by between-breed differences. Consistent with their distinct phenotypes and geographic isolation, the two Sardinian breeds were shown to consist of 2 distinct gene pools that had no gene flow between them. Giara horses were clearly separated from the other breeds examined and showed traces of ancient separation from horses of other breeds that share the same mitochondrial lineage. On the other hand, the data from the Sarcidano horses fit well with variation among breeds from the Iberian Peninsula and North-West Europe: genetic relationships among Sarcidano and the other breeds are consistent with the documented history of this breed

Crossref

PubliCatt

Archivio istituzionale della ricerca - Università di Cagliari

University of Huddersfield Repository

Huddersfield Research Portal