139 research outputs found
A Natural Law of Succession
Consider the problem of multinomial estimation. You are given an alphabet of
k distinct symbols and are told that the i-th symbol occurred exactly n_i times
in the past. On the basis of this information alone, you must now estimate the
conditional probability that the next symbol will be i. In this report, we
present a new solution to this fundamental problem in statistics and
demonstrate that our solution outperforms standard approaches, both in theory
and in practice.Comment: 23 page
Nonuniform Markov models
A statistical language model assigns probability to strings of arbitrary
length. Unfortunately, it is not possible to gather reliable statistics on
strings of arbitrary length from a finite corpus. Therefore, a statistical
language model must decide that each symbol in a string depends on at most a
small, finite number of other symbols in the string. In this report we propose
a new way to model conditional independence in Markov models. The central
feature of our nonuniform Markov model is that it makes predictions of varying
lengths using contexts of varying lengths. Experiments on the Wall Street
Journal reveal that the nonuniform model performs slightly better than the
classic interpolated Markov model. This result is somewhat remarkable because
both models contain identical numbers of parameters whose values are estimated
in a similar manner. The only difference between the two models is how they
combine the statistics of longer and shorter strings.
Keywords: nonuniform Markov model, interpolated Markov model, conditional
independence, statistical language model, discrete time series.Comment: 17 page
Learning Stochastic Tree Edit Distance
pages 42-53International audienceTrees provide a suited structural representation to deal with complex tasks such as web information extraction, RNA secondary structure prediction, or conversion of tree structured documents. In this context, many applications require the calculation of similarities between tree pairs. The most studied distance is likely the tree edit distance for which improvements in terms of complexity have been achieved during the last decade. However, this classic edit distance usually uses a priori fixed edit costs which are often difficult to tune, that leaves little room for tackling complex problems. In this paper, we focus on the learning of a stochastic tree edit distance. We use an adaptation of the expectation-maximization algorithm for learning the primitive edit costs. We carried out several series of experiments that confirm the interest to learn a tree edit distance rather than a priori imposing edit costs
- …