16,569 research outputs found
Degenerating families of dendrograms
Dendrograms used in data analysis are ultrametric spaces, hence objects of
nonarchimedean geometry. It is known that there exist -adic representation
of dendrograms. Completed by a point at infinity, they can be viewed as
subtrees of the Bruhat-Tits tree associated to the -adic projective line.
The implications are that certain moduli spaces known in algebraic geometry are
-adic parameter spaces of (families of) dendrograms, and stochastic
classification can also be handled within this framework. At the end, we
calculate the topology of the hidden part of a dendrogram.Comment: 13 pages, 8 figure
Nonparametric Feature Extraction from Dendrograms
We propose feature extraction from dendrograms in a nonparametric way. The
Minimax distance measures correspond to building a dendrogram with single
linkage criterion, with defining specific forms of a level function and a
distance function over that. Therefore, we extend this method to arbitrary
dendrograms. We develop a generalized framework wherein different distance
measures can be inferred from different types of dendrograms, level functions
and distance functions. Via an appropriate embedding, we compute a vector-based
representation of the inferred distances, in order to enable many numerical
machine learning algorithms to employ such distances. Then, to address the
model selection problem, we study the aggregation of different dendrogram-based
distances respectively in solution space and in representation space in the
spirit of deep representations. In the first approach, for example for the
clustering problem, we build a graph with positive and negative edge weights
according to the consistency of the clustering labels of different objects
among different solutions, in the context of ensemble methods. Then, we use an
efficient variant of correlation clustering to produce the final clusters. In
the second approach, we investigate the sequential combination of different
distances and features sequentially in the spirit of multi-layered
architectures to obtain the final features. Finally, we demonstrate the
effectiveness of our approach via several numerical studies
Functional characteristics of the calcium modulated proteins seen from an evolutionary perspective
We have constructed dendrograms relating 173 EF-hand proteins of known amino acid sequence. We aligned all of these proteins by their EF-hand domains, omitting interdomain regions. Initial dendrograms were computed by minimum mutation distance methods. Using these as starting points, we determined the best dendrogram by the method of maximum parsimony, scored by minimum mutation distance. We identified 14 distinct subfamilies as well as 6 unique proteins that are perhaps the sole representatives of other subfamilies. This information is given in tabular form. Within subfamilies one can easily align interdomain regions. The resulting dendrograms are very similar to those computed using domains only. Dendrograms constructed using pairs of domains show general congruence. However, there are enough exceptions to caution against an overly simple scheme in which one pair of gene duplications leads from one domain precurser to a four domain prototype from which all other forms evolved. The ability to bind calcium was lost and acquired several times during evolution. The distribution of introns does not conform to the dendrogram based on amino acid sequences. The rates of evolution appear to be much slower within subfamilies, especially within calmodulin, than those prior to the definition of subfamily
Measuring Global Similarity between Texts
We propose a new similarity measure between texts which, contrary to the
current state-of-the-art approaches, takes a global view of the texts to be
compared. We have implemented a tool to compute our textual distance and
conducted experiments on several corpuses of texts. The experiments show that
our methods can reliably identify different global types of texts.Comment: Submitted to SLSP 201
- …