28 research outputs found
Normalized Web Distance and Word Similarity
There is a great deal of work in cognitive psychology, linguistics, and
computer science, about using word (or phrase) frequencies in context in text
corpora to develop measures for word similarity or word association, going back
to at least the 1960s. The goal of this chapter is to introduce the
normalizedis a general way to tap the amorphous low-grade knowledge available
for free on the Internet, typed in by local users aiming at personal
gratification of diverse objectives, and yet globally achieving what is
effectively the largest semantic electronic database in the world. Moreover,
this database is available for all by using any search engine that can return
aggregate page-count estimates for a large range of search-queries. In the
paper introducing the NWD it was called `normalized Google distance (NGD),' but
since Google doesn't allow computer searches anymore, we opt for the more
neutral and descriptive NWD. web distance (NWD) method to determine similarity
between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural
Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau
Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN
978-142008592
Uniform test of algorithmic randomness over a general space
The algorithmic theory of randomness is well developed when the underlying
space is the set of finite or infinite sequences and the underlying probability
distribution is the uniform distribution or a computable distribution. These
restrictions seem artificial. Some progress has been made to extend the theory
to arbitrary Bernoulli distributions (by Martin-Loef), and to arbitrary
distributions (by Levin). We recall the main ideas and problems of Levin's
theory, and report further progress in the same framework.
- We allow non-compact spaces (like the space of continuous functions,
underlying the Brownian motion).
- The uniform test (deficiency of randomness) d_P(x) (depending both on the
outcome x and the measure P should be defined in a general and natural way.
- We see which of the old results survive: existence of universal tests,
conservation of randomness, expression of tests in terms of description
complexity, existence of a universal measure, expression of mutual information
as "deficiency of independence.
- The negative of the new randomness test is shown to be a generalization of
complexity in continuous spaces; we show that the addition theorem survives.
The paper's main contribution is introducing an appropriate framework for
studying these questions and related ones (like statistics for a general family
of distributions).Comment: 40 pages. Journal reference and a slight correction in the proof of
Theorem 7 adde
Information Distance: New Developments
In pattern recognition, learning, and data mining one obtains information
from information-carrying objects. This involves an objective definition of the
information in a single object, the information to go from one object to
another object in a pair of objects, the information to go from one object to
any other object in a multiple of objects, and the shared information between
objects. This is called "information distance." We survey a selection of new
developments in information distance.Comment: 4 pages, Latex; Series of Publications C, Report C-2011-45,
Department of Computer Science, University of Helsinki, pp. 71-7
Asymptotics of Discrete MDL for Online Prediction
Minimum Description Length (MDL) is an important principle for induction and
prediction, with strong relations to optimal Bayesian learning. This paper
deals with learning non-i.i.d. processes by means of two-part MDL, where the
underlying model class is countable. We consider the online learning framework,
i.e. observations come in one by one, and the predictor is allowed to update
his state of mind after each time step. We identify two ways of predicting by
MDL for this setup, namely a static} and a dynamic one. (A third variant,
hybrid MDL, will turn out inferior.) We will prove that under the only
assumption that the data is generated by a distribution contained in the model
class, the MDL predictions converge to the true values almost surely. This is
accomplished by proving finite bounds on the quadratic, the Hellinger, and the
Kullback-Leibler loss of the MDL learner, which are however exponentially worse
than for Bayesian prediction. We demonstrate that these bounds are sharp, even
for model classes containing only Bernoulli distributions. We show how these
bounds imply regret bounds for arbitrary loss functions. Our results apply to a
wide range of setups, namely sequence prediction, pattern classification,
regression, and universal induction in the sense of Algorithmic Information
Theory among others.Comment: 34 page