6,287 research outputs found
The similarity metric
A new class of distances appropriate for measuring similarity relations
between sequences, say one type of similarity per distance, is studied. We
propose a new ``normalized information distance'', based on the noncomputable
notion of Kolmogorov complexity, and show that it is in this class and it
minorizes every computable distance in the class (that is, it is universal in
that it discovers all computable similarities). We demonstrate that it is a
metric and call it the {\em similarity metric}. This theory forms the
foundation for a new practical tool. To evidence generality and robustness we
give two distinctive applications in widely divergent areas using standard
compression programs like gzip and GenCompress. First, we compare whole
mitochondrial genomes and infer their evolutionary history. This results in a
first completely automatic computed whole mitochondrial phylogeny tree.
Secondly, we fully automatically compute the language tree of 52 different
languages.Comment: 13 pages, LaTex, 5 figures, Part of this work appeared in Proc. 14th
ACM-SIAM Symp. Discrete Algorithms, 2003. This is the final, corrected,
version to appear in IEEE Trans Inform. T
Effects of constraint curvature on structural instability: tensile buckling and multiple bifurcations
Bifurcation of an elastic structure crucially depends on the curvature of the
constraints against which the ends of the structure are prescribed to move, an
effect which deserves more attention than it has received so far. In fact, we
show theoretically and we provide definitive experimental verification that an
appropriate curvature of the constraint over which the end of a structure has
to slide strongly affects buckling loads and can induce: (i.) tensile buckling;
(ii.) decreasing- (softening), increasing- (hardening), or constant-load (null
stiffness) postcritical behaviour; (iii.) multiple bifurcations, determining
for instance two bifurcation loads (one tensile and one compressive) in a
single-degree-of-freedom elastic system. We show how to design a constraint
profile to obtain a desired postcritical behaviour and we provide the solution
for the elastica constrained to slide along a circle on one end, representing
the first example of an inflexional elastica developed from a buckling in
tension. These results have important practical implications in the design of
compliant mechanisms and may find applications in devices operating in
quasi-static or dynamic conditions
Towards Automated Boundary Value Testing with Program Derivatives and Search
A natural and often used strategy when testing software is to use input
values at boundaries, i.e. where behavior is expected to change the most, an
approach often called boundary value testing or analysis (BVA). Even though
this has been a key testing idea for long it has been hard to clearly define
and formalize. Consequently, it has also been hard to automate.
In this research note we propose one such formalization of BVA by, in a
similar way as to how the derivative of a function is defined in mathematics,
considering (software) program derivatives. Critical to our definition is the
notion of distance between inputs and outputs which we can formalize and then
quantify based on ideas from Information theory.
However, for our (black-box) approach to be practical one must search for
test inputs with specific properties. Coupling it with search-based software
engineering is thus required and we discuss how program derivatives can be used
as and within fitness functions.
This brief note does not allow a deeper, empirical investigation but we use a
simple illustrative example throughout to introduce the main ideas. By
combining program derivatives with search, we thus propose a practical as well
as theoretically interesting technique for automated boundary value (analysis
and) testing
Normalized Web Distance and Word Similarity
There is a great deal of work in cognitive psychology, linguistics, and
computer science, about using word (or phrase) frequencies in context in text
corpora to develop measures for word similarity or word association, going back
to at least the 1960s. The goal of this chapter is to introduce the
normalizedis a general way to tap the amorphous low-grade knowledge available
for free on the Internet, typed in by local users aiming at personal
gratification of diverse objectives, and yet globally achieving what is
effectively the largest semantic electronic database in the world. Moreover,
this database is available for all by using any search engine that can return
aggregate page-count estimates for a large range of search-queries. In the
paper introducing the NWD it was called `normalized Google distance (NGD),' but
since Google doesn't allow computer searches anymore, we opt for the more
neutral and descriptive NWD. web distance (NWD) method to determine similarity
between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural
Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau
Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN
978-142008592
Clustering by compression
We present a new method for clustering based on compression. The method
doesn't use subject-specific features or background knowledge, and works as
follows: First, we determine a universal similarity distance, the normalized
compression distance or NCD, computed from the lengths of compressed data files
(singly and in pairwise concatenation). Second, we apply a hierarchical
clustering method. The NCD is universal in that it is not restricted to a
specific application area, and works across application area boundaries. A
theoretical precursor, the normalized information distance, co-developed by one
of the authors, is provably optimal but uses the non-computable notion of
Kolmogorov complexity. We propose precise notions of similarity metric, normal
compressor, and show that the NCD based on a normal compressor is a similarity
metric that approximates universality. To extract a hierarchy of clusters from
the distance matrix, we determine a dendrogram (binary tree) by a new quartet
method and a fast heuristic to implement it. The method is implemented and
available as public software, and is robust under choice of different
compressors. To substantiate our claims of universality and robustness, we
report evidence of successful application in areas as diverse as genomics,
virology, languages, literature, music, handwritten digits, astronomy, and
combinations of objects from completely different domains, using statistical,
dictionary, and block sorting compressors. In genomics we presented new
evidence for major questions in Mammalian evolution, based on
whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta
hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure
Information Distance: New Developments
In pattern recognition, learning, and data mining one obtains information
from information-carrying objects. This involves an objective definition of the
information in a single object, the information to go from one object to
another object in a pair of objects, the information to go from one object to
any other object in a multiple of objects, and the shared information between
objects. This is called "information distance." We survey a selection of new
developments in information distance.Comment: 4 pages, Latex; Series of Publications C, Report C-2011-45,
Department of Computer Science, University of Helsinki, pp. 71-7
- …