3,897 research outputs found

    Reconstructing Biological and Digital Phylogenetic Trees in Parallel

    Get PDF
    In this paper, we study the parallel query complexity of reconstructing biological and digital phylogenetic trees from simple queries involving their nodes. This is motivated from computational biology, data protection, and computer security settings, which can be abstracted in terms of two parties, a responder, Alice, who must correctly answer queries of a given type regarding a degree-d tree, T, and a querier, Bob, who issues batches of queries, with each query in a batch being independent of the others, so as to eventually infer the structure of T. We show that a querier can efficiently reconstruct an n-node degree-d tree, T, with a logarithmic number of rounds and quasilinear number of queries, with high probability, for various types of queries, including relative-distance queries and path queries. Our results are all asymptotically optimal and improve the asymptotic (sequential) query complexity for one of the problems we study. Moreover, through an experimental analysis using both real-world and synthetic data, we provide empirical evidence that our algorithms provide significant parallel speedups while also improving the total query complexities for the problems we study

    An entropy based heuristic model for predicting functional sub-type divisions of protein families

    Get PDF
    Multiple sequence alignments of protein families are often used for locating residues that are widely apart in the sequence, which are considered as influential for determining functional specificity of proteins towards various substrates, ligands, DNA and other proteins. In this paper, we propose an entropy-score based heuristic algorithm model for predicting functional sub-family divisions of protein families, given the multiple sequence alignment of the protein family as input without any functional sub-type or key site information given for any protein sequence. Two of the experimented test-cases are reported in this paper. First test-case is Nucleotidyl Cyclase protein family consisting of guanalyate and adenylate cyclases. And the second test-case is a dataset of proteins taken from six superfamilies in Structure-Function Linkage Database (SFLD). Results from these test-cases are reported in terms of confirmed sub-type divisions with phylogeny relations from former studies in the literature

    Multivariate Approaches to Classification in Extragalactic Astronomy

    Get PDF
    Clustering objects into synthetic groups is a natural activity of any science. Astrophysics is not an exception and is now facing a deluge of data. For galaxies, the one-century old Hubble classification and the Hubble tuning fork are still largely in use, together with numerous mono-or bivariate classifications most often made by eye. However, a classification must be driven by the data, and sophisticated multivariate statistical tools are used more and more often. In this paper we review these different approaches in order to situate them in the general context of unsupervised and supervised learning. We insist on the astrophysical outcomes of these studies to show that multivariate analyses provide an obvious path toward a renewal of our classification of galaxies and are invaluable tools to investigate the physics and evolution of galaxies.Comment: Open Access paper. http://www.frontiersin.org/milky\_way\_and\_galaxies/10.3389/fspas.2015.00003/abstract\>. \<10.3389/fspas.2015.00003 \&g

    A target enrichment method for gathering phylogenetic information from hundreds of loci: An example from the Compositae.

    Get PDF
    UnlabelledPremise of the studyThe Compositae (Asteraceae) are a large and diverse family of plants, and the most comprehensive phylogeny to date is a meta-tree based on 10 chloroplast loci that has several major unresolved nodes. We describe the development of an approach that enables the rapid sequencing of large numbers of orthologous nuclear loci to facilitate efficient phylogenomic analyses. •Methods and resultsWe designed a set of sequence capture probes that target conserved orthologous sequences in the Compositae. We also developed a bioinformatic and phylogenetic workflow for processing and analyzing the resulting data. Application of our approach to 15 species from across the Compositae resulted in the production of phylogenetically informative sequence data from 763 loci and the successful reconstruction of known phylogenetic relationships across the family. •ConclusionsThese methods should be of great use to members of the broader Compositae community, and the general approach should also be of use to researchers studying other families

    A New Genus of Miniaturized and Pug-Nosed Gecko from South America (Sphaerodactylidae: Gekkota)

    Get PDF
    Sphaerodactyl geckos comprise five genera distributed across Central and South America and the Caribbean. We estimated phylogenetic relationships among sphaerodactyl genera using both separate and combined analyses of seven nuclear genes. Relationships among genera were incongruent at different loci and phylogenies were characterized by short, in some cases zero-length, internal branches and poor phylogenetic support at most nodes. We recovered a polyphyletic Coleodactylus, with Coleodactylus amazonicus being deeply divergent from the remaining Coleodactylus species sampled. The C. amazonicus lineage possessed unique codon deletions in the genes PTPN12 and RBMX while the remaining Coleodactylus species had unique codon deletions in RAG1. Topology tests could not reject a monophyletic Coleodactylus, but we show that short internal branch lengths decreased the accuracy of topology tests because there were not enough data along these short branches to support one phylogenetic hypothesis over another. Morphological data corroborated results of the molecular phylogeny, with Coleodactylus exhibiting substantial morphological heterogeneity. We identified a suite of unique craniofacial features that differentiate C. amazonicus not only from other Coleodactylus species, but also from all other geckos. We describe this novel sphaerodactyl lineage as a new genus, Chatogekko gen. nov. We present a detailed osteology of Chatogekko, characterizing osteological correlates of miniaturization that provide a framework for future studies in sphaerodactyl systematics and biology

    Computational phylogenetics and the classification of South American languages

    Get PDF
    In recent years, South Americanist linguists have embraced computational phylogenetic methods to resolve the numerous outstanding questions about the genealogi- cal relationships among the languages of the continent. We provide a critical review of the methods and language classification results that have accumulated thus far, emphasizing the superiority of character-based methods over distance-based ones and the importance of develop- ing adequate comparative datasets for producing well- resolved classifications

    Clustering by compression

    Full text link
    We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (binary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure
    corecore