12 research outputs found

    Ultrametric Component Analysis with Application to Analysis of Text and of Emotion

    Full text link
    We review the theory and practice of determining what parts of a data set are ultrametric. It is assumed that the data set, to begin with, is endowed with a metric, and we include discussion of how this can be brought about if a dissimilarity, only, holds. The basis for part of the metric-endowed data set being ultrametric is to consider triplets of the observables (vectors). We develop a novel consensus of hierarchical clusterings. We do this in order to have a framework (including visualization and supporting interpretation) for the parts of the data that are determined to be ultrametric. Furthermore a major objective is to determine locally ultrametric relationships as opposed to non-local ultrametric relationships. As part of this work, we also study a particular property of our ultrametricity coefficient, namely, it being a function of the difference of angles of the base angles of the isosceles triangle. This work is completed by a review of related work, on consensus hierarchies, and of a major new application, namely quantifying and interpreting the emotional content of narrative.Comment: 49 pages, 15 figures, 52 citation

    Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution

    Get PDF
    Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend. An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches. Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still. Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms. While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing. Yet, this does not hinder studies of language evolution using automated tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks. Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems

    Un panorama des approximations en norme du supremum pour la classification

    Get PDF
    Dans un cadre général où les concepts de sous-dominante/sur-dominée jouent un rôle fondamental, nous dressons un vaste panorama d’approximations en norme du supremum pour nombre de structures de la classification : ultramétriques (partielles ou non),k-ultramétriques, régressions convexes et isotones. Pour les semi-distances/dissimilarités d’arbre et les dissimilarités de Robinson, nous montrons comment l’approche générale peut conduire à des algorithmes avec un facteur constant.In a general framework where the concepts of subdominant/updominated play a crucial role, we give a vast overview of approximations with the supremum norm for many structures of classi-fication: (partial or nonpartial) ultrametrics,k-ultrametrics, convex and isotonic regressions. For tree semi-distances/dissimilarities and Robinsonian dissimilarities, we show how the general approach leads us to algorithms with a constant factor

    The matroid structure of representative triple sets and triple-closure computation

    Get PDF
    The closure cl (R) of a consistent set R of triples (rooted binary trees on three leaves) provides essential information about tree-like relations that are shown by any supertree that displays all triples in . In this contribution, we are concerned with representative triple sets, that is, subsets R' of R with cl (R') = cl . In this case, R' still contains all information on the tree structure implied by R, although R' might be significantly smaller. We show that representative triple sets that are minimal w.r.t. inclusion form the basis of a matroid. This in turn implies that minimal representative triple sets also have minimum cardinality. In particular, the matroid structure can be used to show that minimum representative triple sets can be computed in polynomial time with a simple greedy approach. For a given triple set R that “identifies” a tree, we provide an exact value for the cardinality of its minimum representative triple sets. In addition, we utilize the latter results to provide a novel and efficient method to compute the closure cl (R) of a consistent triple set R that improves the time complexity (R Lr 4) of the currently fastest known method proposed by Bryant and Steel (1995). In particular, if a minimum representative triple set for R is given, it can be shown that the time complexity to compute cl (R) can be improved by a factor up to R Lr . As it turns out, collections of quartets (unrooted binary trees on four leaves) do not provide a matroid structure, in general

    New Algorithms andMethodology for Analysing Distances

    Get PDF
    Distances arise in a wide variety of di�erent contexts, one of which is partitional clustering, that is, the problem of �nding groups of similar objects within a set of objects.¿ese groups are seemingly very easy to �nd for humans, but very di�cult to �nd for machines as there are two major di�culties to be overcome: the �rst de�ning an objective criterion for the vague notion of “groups of similar objects”, and the second is the computational complexity of �nding such groups given a criterion. In the �rst part of this thesis, we focus on the �rst di�culty and show that even seemingly similar optimisation criteria used for partitional clustering can produce vastly di�erent results. In the process of showing this we develop a new metric for comparing clustering solutions called the assignment metric. We then prove some new NP-completeness results for problems using two related “sum-of-squares” clustering criteria. Closely related to partitional clustering is the problem of hierarchical clustering. We extend and formalise this problem to the problem of constructing rooted edge-weighted X-trees, that is trees with a leafset X. It is well known that an X-tree can be uniquely reconstructed from a distance on X if the distance is an ultrametric. But in practice the complete distance on X may not always be available. In the second part of this thesis we look at some of the circumstances under which a tree can be uniquely reconstructed from incomplete distance information. We use a concept called a lasso and give some theoretical properties of a special type of lasso. We then develop an algorithm which can construct a tree together with a lasso from partial distance information and show how this can be applied to various incomplete datasets

    Clustering-Based Robot Navigation and Control

    Get PDF
    In robotics, it is essential to model and understand the topologies of configuration spaces in order to design provably correct motion planners. The common practice in motion planning for modelling configuration spaces requires either a global, explicit representation of a configuration space in terms of standard geometric and topological models, or an asymptotically dense collection of sample configurations connected by simple paths, capturing the connectivity of the underlying space. This dissertation introduces the use of clustering for closing the gap between these two complementary approaches. Traditionally an unsupervised learning method, clustering offers automated tools to discover hidden intrinsic structures in generally complex-shaped and high-dimensional configuration spaces of robotic systems. We demonstrate some potential applications of such clustering tools to the problem of feedback motion planning and control. The first part of the dissertation presents the use of hierarchical clustering for relaxed, deterministic coordination and control of multiple robots. We reinterpret this classical method for unsupervised learning as an abstract formalism for identifying and representing spatially cohesive and segregated robot groups at different resolutions, by relating the continuous space of configurations to the combinatorial space of trees. Based on this new abstraction and a careful topological characterization of the associated hierarchical structure, a provably correct, computationally efficient hierarchical navigation framework is proposed for collision-free coordinated motion design towards a designated multirobot configuration via a sequence of hierarchy-preserving local controllers. The second part of the dissertation introduces a new, robot-centric application of Voronoi diagrams to identify a collision-free neighborhood of a robot configuration that captures the local geometric structure of a configuration space around the robot’s instantaneous position. Based on robot-centric Voronoi diagrams, a provably correct, collision-free coverage and congestion control algorithm is proposed for distributed mobile sensing applications of heterogeneous disk-shaped robots; and a sensor-based reactive navigation algorithm is proposed for exact navigation of a disk-shaped robot in forest-like cluttered environments. These results strongly suggest that clustering is, indeed, an effective approach for automatically extracting intrinsic structures in configuration spaces and that it might play a key role in the design of computationally efficient, provably correct motion planners in complex, high-dimensional configuration spaces

    Gene Family Histories: Theory and Algorithms

    Get PDF
    Detailed gene family histories and reconciliations with species trees are a prerequisite for studying associations between genetic and phenotypic innovations. Even though the true evolutionary scenarios are usually unknown, they impose certain constraints on the mathematical structure of data obtained from simple yes/no questions in pairwise comparisons of gene sequences. Recent advances in this field have led to the development of methods for reconstructing (aspects of) the scenarios on the basis of such relation data, which can most naturally be represented by graphs on the set of considered genes. We provide here novel characterizations of best match graphs (BMGs) which capture the notion of (reciprocal) best hits based on sequence similarities. BMGs provide the basis for the detection of orthologous genes (genes that diverged after a speciation event). There are two main sources of error in pipelines for orthology inference based on BMGs. Firstly, measurement errors in the estimation of best matches from sequence similarity in general lead to violations of the characteristic properties of BMGs. The second issue concerns the reconstruction of the orthology relation from a BMG. We show how to correct estimated BMG to mathematically valid ones and how much information about orthologs is contained in BMGs. We then discuss implicit methods for horizontal gene transfer (HGT) inference that focus on pairs of genes that have diverged only after the divergence of the two species in which the genes reside. This situation defines the edge set of an undirected graph, the later-divergence-time (LDT) graph. We explore the mathematical structure of LDT graphs and show how much information about all HGT events is contained in such LDT graphs

    Mitochondrial DNA diversity and origin of human communities from 4th- 11th century Britain.

    Get PDF
    Neither the archaeological nor the historical data have yet allowed a full understanding of the nature of the Germanic settlement in England. Analysis of the genetic structure of past history has mostly been carried out by inference from extant populations. However, genetic flow through migration over time is likely to have altered the genetic composition of modem samples. Analysis of the genetic composition of ancient populations (provided the authenticity of their DNA is obtained) gives a direct sight into the past. Thus, mitochondrial DNA from pre-Saxon (4th century), early Saxon (5th -7th century) and late Saxon (9th – 11th century) settlements has been analysed to obtain a better understanding of the population history of Britain. A methodology has been optimised, by which, ancient DNA from 1,000-1,800 year old archaeological material was extracted and ~200-bp fragments of the HVS-1, amplified and sequenced. Rigorous controls for work in human ancient DNA were undertaken to prevent and recognise contamination. Established authenticity criteria were followed, including expected ancient DNA behaviour, internal replication of sequences and confirmation by independent labs. The sample size obtained has enabled a population-level study of communities of ancient Britain. In addition, an extensive database of >6500 mitochondrial DNA sequences was compiled for comparisons. Several estimates of haplotype and nucleotide genetic diversity were computed for modem and ancient populations. Counter-intuitively, the modem population of England, encompassing all successive waves of migration to the island, has a lower diversity than the ancient population, suggesting that diversity has been lost over the last millennium. In addition, mtDNA genetic continuity between ancient and modem England seems to have been intermpted. Founder analyses of early (5th -7th century) and late (9th -11th century) periods indicate that, whereas the late period seems to have had Viking genetic influences, the early period has no close relationship with Germanic populations. Instead, the females of the early Anglo-Saxon period seem to represent the native British population. The female contribution of the Anglo-Saxon invasion would have therefore been minor, at least at that time and at these sites. The close genetic affinity between the ancient British population and the northern most populations of Europe suggests they might have shared a common past during pre-history. It is proposed that, after post-glacial times, inhabitants of areas now submerged expanded to northern territories. The early settlements analysed reflect that very early expansion. Some time since then, reduction in diversity seem to have occurred (possibly due to variation in family size after repeated epidemics) leading to the present day mtDNA composition of England

    On The Hardness of Inferring Phylogenies from Triplet-Dissimilarities

    Get PDF
    This work considers the problem of reconstructing a phylogenetic tree from triplet dissimilarities, which are dissimilarities defined over taxontriplets. Triplet dissimilarities are possibly the simplest generalization of pairwise dissimilarities, and were used for phylogenetic reconstructions in the past few years. We study the hardness of finding a tree best fitting a given triplet-dissimilarity table under the ℓ ∞ norm. We show that the corresponding decision problem is NP-hard and that the corresponding optimization problem cannot be approximated in polynomial time within a constant multiplicative factor smaller than 1.4. On the positive side, we present a polynomial time constant-rate approximation algorithm for this problem. We also address the issue of best-fit under maximal distortion, which corresponds to the largest ratio between matching entries in two triplet-dissimilarity tables. We show that it is NP-hard to approximate the corresponding optimization problem within any constant multiplicative factor.
    corecore