2,130 research outputs found
Personalised Search Time Prediction using Markov Chains
For improving the effectiveness of Interactive Information Retrieval (IIR), a system should minimise the search time by guiding the user appropriately. As a prerequisite, in any search situation, the system must be able to estimate the time the user will need for finding the next relevant document. In this paper, we show how Markov models derived from search logs can be used for predicting search times, and describe a method for evaluating these predictions. For personalising the predictions based upon a few user events observed, we devise appropriate parameter estimation methods. Our experimental results show that by observing users for only 100 seconds, the personalised predictions are already significantly better than global predictions
Maximum Coverage in the Data Stream Model: Parameterized and Generalized
We present algorithms for the Max-Cover and Max-Unique-Cover problems in the
data stream model. The input to both problems are subsets of a universe of
size and a value . In Max-Cover, the problem is to find a
collection of at most sets such that the number of elements covered by at
least one set is maximized. In Max-Unique-Cover, the problem is to find a
collection of at most sets such that the number of elements covered by
exactly one set is maximized. Our goal is to design single-pass algorithms that
use space that is sublinear in the input size. Our main algorithmic results
are:
If the sets have size at most , there exist single-pass algorithms using
space that solve both problems exactly. This is
optimal up to polylogarithmic factors for constant .
If each element appears in at most sets, we present single pass
algorithms using space that return a
approximation in the case of Max-Cover. We also present a single-pass algorithm
using slightly more memory, i.e., space, that
approximates Max-Unique-Cover.
In contrast to the above results, when and are arbitrary, any
constant pass approximation algorithm for either problem requires
space but a single pass space
algorithm exists. In fact any constant-pass algorithm with an approximation
better than and for Max-Cover and Max-Unique-Cover
respectively requires space when and are unrestricted.
En route, we also obtain an algorithm for a parameterized version of the
streaming Set-Cover problem.Comment: Conference version to appear at ICDT 202
Model-based clustering of large networks
We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables
R-VGAL: A Sequential Variational Bayes Algorithm for Generalised Linear Mixed Models
Models with random effects, such as generalised linear mixed models (GLMMs),
are often used for analysing clustered data. Parameter inference with these
models is difficult because of the presence of cluster-specific random effects,
which must be integrated out when evaluating the likelihood function. Here, we
propose a sequential variational Bayes algorithm, called Recursive Variational
Gaussian Approximation for Latent variable models (R-VGAL), for estimating
parameters in GLMMs. The R-VGAL algorithm operates on the data sequentially,
requires only a single pass through the data, and can provide parameter updates
as new data are collected without the need of re-processing the previous data.
At each update, the R-VGAL algorithm requires the gradient and Hessian of a
"partial" log-likelihood function evaluated at the new observation, which are
generally not available in closed form for GLMMs. To circumvent this issue, we
propose using an importance-sampling-based approach for estimating the gradient
and Hessian via Fisher's and Louis' identities. We find that R-VGAL can be
unstable when traversing the first few data points, but that this issue can be
mitigated by using a variant of variational tempering in the initial steps of
the algorithm. Through illustrations on both simulated and real datasets, we
show that R-VGAL provides good approximations to the exact posterior
distributions, that it can be made robust through tempering, and that it is
computationally efficient
- …