15 research outputs found
Computing the entropy of user navigation in the web
Navigation through the web, colloquially known as "surfing", is one of the main activities of users during web interaction. When users follow a navigation trail they often tend to get disoriented in terms of the goals of their original query and thus the discovery of typical user trails could be useful in providing navigation assistance. Herein, we give a theoretical underpinning of user navigation in terms of the entropy of an underlying Markov chain modelling the web topology. We present a novel method for online incremental computation of the entropy and a large deviation result regarding the length of a trail to realize the said entropy. We provide an error analysis for our estimation of the entropy in terms of the divergence between the empirical and actual probabilities. We then indicate applications of our algorithm in the area of web data mining. Finally, we present an extension of our technique to higher-order Markov chains by a suitable reduction of a higher-order Markov chain model to a first-order one
Scaling laws for learning high-dimensional Markov forest distributions
The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is structurally consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to be structurally consistent. In addition, the extremal structures for learning are identified; we prove that the independent (resp. tree) model is the hardest (resp. easiest) to learn using the proposed algorithm in terms of error rates for structure learning.United States. Air Force Office of Scientific Research (Grant FA9559-08-1- 1080)United States. Army Research Office. Multidisciplinary University Research Initiative (Grant W911NF-06-1-0076)United States. Army Research Office. Multidisciplinary University Research Initiative (Grant FA9550-06-1-0324)Singapore. Agency for Science, Technology and Researc
Neighborhood radius estimation in Variable-neighborhood Random Fields
We consider random fields defined by finite-region conditional probabilities
depending on a neighborhood of the region which changes with the boundary
conditions. To predict the symbols within any finite region it is necessary to
inspect a random number of neighborhood symbols which might change according to
the value of them. In analogy to the one dimensional setting we call these
neighborhood symbols the context of the region. This framework is a natural
extension, to d-dimensional fields, of the notion of variable-length Markov
chains introduced by Rissanen (1983) in his classical paper. We define an
algorithm to estimate the radius of the smallest ball containing the context
based on a realization of the field. We prove the consistency of this
estimator. Our proofs are constructive and yield explicit upper bounds for the
probability of wrong estimation of the radius of the context
Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
The problem of learning forest-structured discrete graphical models from
i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu
tree through adaptive thresholding is proposed. It is shown that this algorithm
is both structurally consistent and risk consistent and the error probability
of structure learning decays faster than any polynomial in the number of
samples under fixed model size. For the high-dimensional scenario where the
size of the model d and the number of edges k scale with the number of samples
n, sufficient conditions on (n,d,k) are given for the algorithm to satisfy
structural and risk consistencies. In addition, the extremal structures for
learning are identified; we prove that the independent (resp. tree) model is
the hardest (resp. easiest) to learn using the proposed algorithm in terms of
error rates for structure learning.Comment: Accepted to the Journal of Machine Learning Research (Feb 2011