1,095 research outputs found

    Novel Distances for Dollo Data

    Full text link
    We investigate distances on binary (presence/absence) data in the context of a Dollo process, where a trait can only arise once on a phylogenetic tree but may be lost many times. We introduce a novel distance, the Additive Dollo Distance (ADD), which is consistent for data generated under a Dollo model, and show that it has some useful theoretical properties including an intriguing link to the LogDet distance. Simulations of Dollo data are used to compare a number of binary distances including ADD, LogDet, Nei Li and some simple, but to our knowledge previously unstudied, variations on common binary distances. The simulations suggest that ADD outperforms other distances on Dollo data. Interestingly, we found that the LogDet distance performs poorly in the context of a Dollo process, which may have implications for its use in connection with conditioned genome reconstruction. We apply the ADD to two Diversity Arrays Technology (DArT) datasets, one that broadly covers Eucalyptus species and one that focuses on the Eucalyptus series Adnataria. We also reanalyse gene family presence/absence data on bacteria from the COG database and compare the results to previous phylogenies estimated using the conditioned genome reconstruction approach

    Systematic Analysis of Cluster Similarity Indices: How to Validate Validation Measures

    Get PDF
    Many cluster similarity indices are used to evaluate clustering algorithms, and choosing the best one for a particular task remains an open problem. We demonstrate that this problem is crucial: there are many disagreements among the indices, these disagreements do affect which algorithms are preferred in applications, and this can lead to degraded performance in real-world systems. We propose a theoretical framework to tackle this problem: we develop a list of desirable properties and conduct an extensive theoretical analysis to verify which indices satisfy them. This allows for making an informed choice: given a particular application, one can first select properties that are desirable for the task and then identify indices satisfying these. Our work unifies and considerably extends existing attempts at analyzing cluster similarity indices: we introduce new properties, formalize existing ones, and mathematically prove or disprove each property for an extensive list of validation indices. This broader and more rigorous approach leads to recommendations that considerably differ from how validation indices are currently being chosen by practitioners. Some of the most popular indices are even shown to be dominated by previously overlooked ones

    Decompositions of Triangle-Dense Graphs

    Full text link
    High triangle density -- the graph property stating that a constant fraction of two-hop paths belong to a triangle -- is a common signature of social networks. This paper studies triangle-dense graphs from a structural perspective. We prove constructively that significant portions of a triangle-dense graph are contained in a disjoint union of dense, radius 2 subgraphs. This result quantifies the extent to which triangle-dense graphs resemble unions of cliques. We also show that our algorithm recovers planted clusterings in approximation-stable k-median instances.Comment: 20 pages. Version 1->2: Minor edits. 2->3: Strengthened {\S}3.5, removed appendi

    Group Analysis of Self-organizing Maps based on Functional MRI using Restricted Frechet Means

    Full text link
    Studies of functional MRI data are increasingly concerned with the estimation of differences in spatio-temporal networks across groups of subjects or experimental conditions. Unsupervised clustering and independent component analysis (ICA) have been used to identify such spatio-temporal networks. While these approaches have been useful for estimating these networks at the subject-level, comparisons over groups or experimental conditions require further methodological development. In this paper, we tackle this problem by showing how self-organizing maps (SOMs) can be compared within a Frechean inferential framework. Here, we summarize the mean SOM in each group as a Frechet mean with respect to a metric on the space of SOMs. We consider the use of different metrics, and introduce two extensions of the classical sum of minimum distance (SMD) between two SOMs, which take into account the spatio-temporal pattern of the fMRI data. The validity of these methods is illustrated on synthetic data. Through these simulations, we show that the three metrics of interest behave as expected, in the sense that the ones capturing temporal, spatial and spatio-temporal aspects of the SOMs are more likely to reach significance under simulated scenarios characterized by temporal, spatial and spatio-temporal differences, respectively. In addition, a re-analysis of a classical experiment on visually-triggered emotions demonstrates the usefulness of this methodology. In this study, the multivariate functional patterns typical of the subjects exposed to pleasant and unpleasant stimuli are found to be more similar than the ones of the subjects exposed to emotionally neutral stimuli. Taken together, these results indicate that our proposed methods can cast new light on existing data by adopting a global analytical perspective on functional MRI paradigms.Comment: 23 pages, 5 figures, 4 tables. Submitted to Neuroimag

    Measure based metrics for aggregated data

    Get PDF
    Aggregated data arises commonly from surveys and censuses where groups of individuals are studied as coherent entities. The aggregated data can take many forms including sets, intervals, distributions and histograms. The data analyst needs to measure the similarity between such aggregated data items and a range of metrics are reported in the literature to achieve this (e.g. the Jaccard metric for sets and the Wasserstein metric for histograms). In this paper, a unifying theory based on measure theory is developed that establishes not only that known metrics are essentially similar but also suggests new metrics
    • …
    corecore