79,082 research outputs found
On Empirical Entropy
We propose a compression-based version of the empirical entropy of a finite
string over a finite alphabet. Whereas previously one considers the naked
entropy of (possibly higher order) Markov processes, we consider the sum of the
description of the random variable involved plus the entropy it induces. We
assume only that the distribution involved is computable. To test the new
notion we compare the Normalized Information Distance (the similarity metric)
with a related measure based on Mutual Information in Shannon's framework. This
way the similarities and differences of the last two concepts are exposed.Comment: 14 pages, LaTe
Quantum Kolmogorov Complexity Based on Classical Descriptions
We develop a theory of the algorithmic information in bits contained in an
individual pure quantum state. This extends classical Kolmogorov complexity to
the quantum domain retaining classical descriptions. Quantum Kolmogorov
complexity coincides with the classical Kolmogorov complexity on the classical
domain. Quantum Kolmogorov complexity is upper bounded and can be effectively
approximated from above under certain conditions. With high probability a
quantum object is incompressible. Upper- and lower bounds of the quantum
complexity of multiple copies of individual pure quantum states are derived and
may shed some light on the no-cloning properties of quantum states. In the
quantum situation complexity is not sub-additive. We discuss some relations
with ``no-cloning'' and ``approximate cloning'' properties.Comment: 17 pages, LaTeX, final and extended version of quant-ph/9907035, with
corrections to the published journal version (the two displayed equations in
the right-hand column on page 2466 had the left-hand sides of the displayed
formulas erroneously interchanged
The Google Similarity Distance
Words and phrases acquire meaning from the way they are used in society, from
their relative semantics to other words and phrases. For computers the
equivalent of `society' is `database,' and the equivalent of `use' is `way to
search the database.' We present a new theory of similarity between words and
phrases based on information distance and Kolmogorov complexity. To fix
thoughts we use the world-wide-web as database, and Google as search engine.
The method is also applicable to other search engines and databases. This
theory is then applied to construct a method to automatically extract
similarity, the Google similarity distance, of words and phrases from the
world-wide-web using Google page counts. The world-wide-web is the largest
database on earth, and the context information entered by millions of
independent users averages out to provide automatic semantics of useful
quality. We give applications in hierarchical clustering, classification, and
language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch masters and names of
books by English novelists, the ability to understand emergencies, and primes,
and we demonstrate the ability to do a simple automatic English-Spanish
translation. Finally, we use the WordNet database as an objective baseline
against which to judge the performance of our method. We conduct a massive
randomized trial in binary classification using support vector machines to
learn categories based on our Google distance, resulting in an a mean agreement
of 87% with the expert crafted WordNet categories.Comment: 15 pages, 10 figures; changed some text/figures/notation/part of
theorem. Incorporated referees comments. This is the final published version
up to some minor changes in the galley proof
Algorithmic Identification of Probabilities
TThe problem is to identify a probability associated with a set of natural
numbers, given an infinite data sequence of elements from the set. If the given
sequence is drawn i.i.d. and the probability mass function involved (the
target) belongs to a computably enumerable (c.e.) or co-computably enumerable
(co-c.e.) set of computable probability mass functions, then there is an
algorithm to almost surely identify the target in the limit. The technical tool
is the strong law of large numbers. If the set is finite and the elements of
the sequence are dependent while the sequence is typical in the sense of
Martin-L\"of for at least one measure belonging to a c.e. or co-c.e. set of
computable measures, then there is an algorithm to identify in the limit a
computable measure for which the sequence is typical (there may be more than
one such measure). The technical tool is the theory of Kolmogorov complexity.
We give the algorithms and consider the associated predictions.Comment: 19 pages LaTeX.Corrected errors and rewrote the entire paper. arXiv
admin note: text overlap with arXiv:1208.500
A New Quartet Tree Heuristic for Hierarchical Clustering
We consider the problem of constructing an an optimal-weight tree from the
3*(n choose 4) weighted quartet topologies on n objects, where optimality means
that the summed weight of the embedded quartet topologiesis optimal (so it can
be the case that the optimal tree embeds all quartets as non-optimal
topologies). We present a heuristic for reconstructing the optimal-weight tree,
and a canonical manner to derive the quartet-topology weights from a given
distance matrix. The method repeatedly transforms a bifurcating tree, with all
objects involved as leaves, achieving a monotonic approximation to the exact
single globally optimal tree. This contrasts to other heuristic search methods
from biological phylogeny, like DNAML or quartet puzzling, which, repeatedly,
incrementally construct a solution from a random order of objects, and
subsequently add agreement values.Comment: 22 pages, 14 figure
- …