1,095 research outputs found
Novel Distances for Dollo Data
We investigate distances on binary (presence/absence) data in the context of
a Dollo process, where a trait can only arise once on a phylogenetic tree but
may be lost many times. We introduce a novel distance, the Additive Dollo
Distance (ADD), which is consistent for data generated under a Dollo model, and
show that it has some useful theoretical properties including an intriguing
link to the LogDet distance. Simulations of Dollo data are used to compare a
number of binary distances including ADD, LogDet, Nei Li and some simple, but
to our knowledge previously unstudied, variations on common binary distances.
The simulations suggest that ADD outperforms other distances on Dollo data.
Interestingly, we found that the LogDet distance performs poorly in the context
of a Dollo process, which may have implications for its use in connection with
conditioned genome reconstruction. We apply the ADD to two Diversity Arrays
Technology (DArT) datasets, one that broadly covers Eucalyptus species and one
that focuses on the Eucalyptus series Adnataria. We also reanalyse gene family
presence/absence data on bacteria from the COG database and compare the results
to previous phylogenies estimated using the conditioned genome reconstruction
approach
Systematic Analysis of Cluster Similarity Indices: How to Validate Validation Measures
Many cluster similarity indices are used to evaluate clustering algorithms,
and choosing the best one for a particular task remains an open problem. We
demonstrate that this problem is crucial: there are many disagreements among
the indices, these disagreements do affect which algorithms are preferred in
applications, and this can lead to degraded performance in real-world systems.
We propose a theoretical framework to tackle this problem: we develop a list of
desirable properties and conduct an extensive theoretical analysis to verify
which indices satisfy them. This allows for making an informed choice: given a
particular application, one can first select properties that are desirable for
the task and then identify indices satisfying these. Our work unifies and
considerably extends existing attempts at analyzing cluster similarity indices:
we introduce new properties, formalize existing ones, and mathematically prove
or disprove each property for an extensive list of validation indices. This
broader and more rigorous approach leads to recommendations that considerably
differ from how validation indices are currently being chosen by practitioners.
Some of the most popular indices are even shown to be dominated by previously
overlooked ones
Decompositions of Triangle-Dense Graphs
High triangle density -- the graph property stating that a constant fraction
of two-hop paths belong to a triangle -- is a common signature of social
networks. This paper studies triangle-dense graphs from a structural
perspective. We prove constructively that significant portions of a
triangle-dense graph are contained in a disjoint union of dense, radius 2
subgraphs. This result quantifies the extent to which triangle-dense graphs
resemble unions of cliques. We also show that our algorithm recovers planted
clusterings in approximation-stable k-median instances.Comment: 20 pages. Version 1->2: Minor edits. 2->3: Strengthened {\S}3.5,
removed appendi
Group Analysis of Self-organizing Maps based on Functional MRI using Restricted Frechet Means
Studies of functional MRI data are increasingly concerned with the estimation
of differences in spatio-temporal networks across groups of subjects or
experimental conditions. Unsupervised clustering and independent component
analysis (ICA) have been used to identify such spatio-temporal networks. While
these approaches have been useful for estimating these networks at the
subject-level, comparisons over groups or experimental conditions require
further methodological development. In this paper, we tackle this problem by
showing how self-organizing maps (SOMs) can be compared within a Frechean
inferential framework. Here, we summarize the mean SOM in each group as a
Frechet mean with respect to a metric on the space of SOMs. We consider the use
of different metrics, and introduce two extensions of the classical sum of
minimum distance (SMD) between two SOMs, which take into account the
spatio-temporal pattern of the fMRI data. The validity of these methods is
illustrated on synthetic data. Through these simulations, we show that the
three metrics of interest behave as expected, in the sense that the ones
capturing temporal, spatial and spatio-temporal aspects of the SOMs are more
likely to reach significance under simulated scenarios characterized by
temporal, spatial and spatio-temporal differences, respectively. In addition, a
re-analysis of a classical experiment on visually-triggered emotions
demonstrates the usefulness of this methodology. In this study, the
multivariate functional patterns typical of the subjects exposed to pleasant
and unpleasant stimuli are found to be more similar than the ones of the
subjects exposed to emotionally neutral stimuli. Taken together, these results
indicate that our proposed methods can cast new light on existing data by
adopting a global analytical perspective on functional MRI paradigms.Comment: 23 pages, 5 figures, 4 tables. Submitted to Neuroimag
Measure based metrics for aggregated data
Aggregated data arises commonly from surveys and censuses where groups of individuals are studied as coherent entities. The aggregated data can take many forms including sets, intervals, distributions and histograms. The data analyst needs to measure the similarity between such aggregated data items and a range of metrics are reported in the literature to achieve this (e.g. the Jaccard metric for sets and the Wasserstein metric for histograms). In this paper, a unifying theory based on measure theory is developed that establishes not only that known metrics are essentially similar but also suggests new metrics
- …