7,793 research outputs found

    BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction

    Get PDF
    A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN

    Topology Discovery of Sparse Random Graphs With Few Participants

    Get PDF
    We consider the task of topology discovery of sparse random graphs using end-to-end random measurements (e.g., delay) between a subset of nodes, referred to as the participants. The rest of the nodes are hidden, and do not provide any information for topology discovery. We consider topology discovery under two routing models: (a) the participants exchange messages along the shortest paths and obtain end-to-end measurements, and (b) additionally, the participants exchange messages along the second shortest path. For scenario (a), our proposed algorithm results in a sub-linear edit-distance guarantee using a sub-linear number of uniformly selected participants. For scenario (b), we obtain a much stronger result, and show that we can achieve consistent reconstruction when a sub-linear number of uniformly selected nodes participate. This implies that accurate discovery of sparse random graphs is tractable using an extremely small number of participants. We finally obtain a lower bound on the number of participants required by any algorithm to reconstruct the original random graph up to a given edit distance. We also demonstrate that while consistent discovery is tractable for sparse random graphs using a small number of participants, in general, there are graphs which cannot be discovered by any algorithm even with a significant number of participants, and with the availability of end-to-end information along all the paths between the participants.Comment: A shorter version appears in ACM SIGMETRICS 2011. This version is scheduled to appear in J. on Random Structures and Algorithm

    Cultural Phylogenetics of the Tupi Language Family in Lowland South America

    Get PDF
    Background: Recent advances in automated assessment of basic vocabulary lists allow the construction of linguistic phylogenies useful for tracing dynamics of human population expansions, reconstructing ancestral cultures, and modeling transition rates of cultural traits over time. Methods: Here we investigate the Tupi expansion, a widely-dispersed language family in lowland South America, with a distance-based phylogeny based on 40-word vocabulary lists from 48 languages. We coded 11 cultural traits across the diverse Tupi family including traditional warfare patterns, post-marital residence, corporate structure, community size, paternity beliefs, sibling terminology, presence of canoes, tattooing, shamanism, men’s houses, and lip plugs. Results/Discussion: The linguistic phylogeny supports a Tupi homeland in west-central Brazil with subsequent major expansions across much of lowland South America. Consistently, ancestral reconstructions of cultural traits over the linguistic phylogeny suggest that social complexity has tended to decline through time, most notably in the independent emergence of several nomadic hunter-gatherer societies. Estimated rates of cultural change across the Tupi expansion are on the order of only a few changes per 10,000 years, in accord with previous cultural phylogenetic results in other languag

    Evolutionary dynamics of structural features

    Get PDF
    Structural features have the potential to push the time barrier, after which we cannot test hypotheses about relatedness of languages, back in time. However, we have to know the stability of structural features in order to be able to apply them for such purposes. In this thesis I describe the typological profile of the Transeurasian languages, which serve as a data sample for the analysis of stability, build a phylogenetic tree with these languages, measure the stability of structural features as phylogenetic signal and evolutionary rate, reconstruct ancestral states of structural features and apply an admixture model from population genetics to test the performance of phonological, morphological and syntactic features in assigning languages to their respective language families and to investigate the level of diffusion in these three feature sets. More than half of structural features appear to have a high phylogenetic signal and evolve at a slow rate. I compare the stability across functional categories, parts of speech and language levels and come to a conclusion that argument marking (flagging and indexing), derivation and valency are the most stable functional categories, pronouns and nouns the most stable parts of speech and phonology and morphology the most stable language levels. The admixture model as implemented in STRUCTURE is able to correctly identify Turkic, Mongolic and Tungusic language families at the levels of morphology and syntax, whereas Japonic and Koreanic languages are assigned to the same ancestry. We see the least amount of admixture at the level of morphology and the highest level of admixture in syntactic features. One of the most important insights is that morphological features carry the most genealogical information, and these features could be used in the future to test relationships above the language family level

    A Fast-Graph Approach to Modeling Similarity of Whole Genomes

    Get PDF
    As increasing numbers of closely related genomic sequences become available, the need to develop methods for detecting fine differences among them also grows apparent. Several calls have been made for improved algorithms to exploit the wealth of pathogenic viral and bacterial sequence data that are rapidly becoming available to researchers. The first stage of our research addresses the computational limitations associated with whole-genome comparisons of large numbers of subspecies sequences. We investigate the potential for the use of fast, word-based comparative measures to approximate computationally expensive, full alignment comparison methods. Recent advances in next generation sequencing are providing a number of large whole-genome sequence datasets stemming from globally distributed disease occurrences. This offers an unprecedented opportunity for epidemiological studies and the development of computationally efficient, robust tools for such studies. In the second stage of our research, we present an approach that enables a quick, effective, and robust epidemiological analysis of large whole-genome datasets. We then apply our method to a complex dataset of over 4,200 globally sampled Influenza A virus isolates from multiple host types, subtypes and years. These sequences are compared using an alignment-free method that runs in linear-time. These comparisons enable us to build 2-dimensional graphs that represent the relationships between sequences, where sequences are viewed as vertices, and high-degree sequence similarity as edges. These graphs prove useful, as they are able to model potential disease transmission paths when applied to viral sequences. Mixing patterns are then used to study the occurrence and patterns of edges between different types of sequence groups, such as the host type and year of collection, to better understand the potential of genotypic transfer between sequence groups
    corecore