Search CORE

2,130 research outputs found

Personalised Search Time Prediction using Markov Chains

Author: Azzopardi Leif
Fuhr Norbert
Maxwell David
Tran Vu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

For improving the effectiveness of Interactive Information Retrieval (IIR), a system should minimise the search time by guiding the user appropriately. As a prerequisite, in any search situation, the system must be able to estimate the time the user will need for finding the next relevant document. In this paper, we show how Markov models derived from search logs can be used for predicting search times, and describe a method for evaluating these predictions. For personalising the predictions based upon a few user events observed, we devise appropriate parameter estimation methods. Our experimental results show that by observing users for only 100 seconds, the personalised predictions are already significantly better than global predictions

Crossref

University of Strathclyde Institutional Repository

Enlighten

Maximum Coverage in the Data Stream Model: Parameterized and Generalized

Author: McGregor Andrew
Tench David
Vu Hoa T.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th International Conference on Database Theory (ICDT 2021)
Publication date: 01/01/2021
Field of study

We present algorithms for the Max-Cover and Max-Unique-Cover problems in the data stream model. The input to both problems are

m

subsets of a universe of size

n

and a value

k\in [m]

. In Max-Cover, the problem is to find a collection of at most

k

sets such that the number of elements covered by at least one set is maximized. In Max-Unique-Cover, the problem is to find a collection of at most

k

sets such that the number of elements covered by exactly one set is maximized. Our goal is to design single-pass algorithms that use space that is sublinear in the input size. Our main algorithmic results are: If the sets have size at most

d

, there exist single-pass algorithms using

\tilde{O}(d^{d+1} k^d)

space that solve both problems exactly. This is optimal up to polylogarithmic factors for constant

d

. If each element appears in at most

r

sets, we present single pass algorithms using

\tilde{O}(k^2 r/\epsilon^3)

space that return a

1+\epsilon

approximation in the case of Max-Cover. We also present a single-pass algorithm using slightly more memory, i.e.,

\tilde{O}(k^3 r/\epsilon^{4})

space, that

1+\epsilon

approximates Max-Unique-Cover. In contrast to the above results, when

d

and

r

are arbitrary, any constant pass

1+\epsilon

approximation algorithm for either problem requires

\Omega(\epsilon^{-2}m)

space but a single pass

O(\epsilon^{-2}mk)

space algorithm exists. In fact any constant-pass algorithm with an approximation better than

e/(e-1)

and

e^{1-1/k}

for Max-Cover and Max-Unique-Cover respectively requires

\Omega(m/k^2)

space when

d

and

r

are unrestricted. En route, we also obtain an algorithm for a parameterized version of the streaming Set-Cover problem.Comment: Conference version to appear at ICDT 202

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Model-based clustering of large networks

Author: Hunter David R.
Schweinberger Michael
Vu Duy Q.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2013
Field of study

We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables

arXiv.org e-Print Archive

PubMed Central

DSpace at Rice University

R-VGAL: A Sequential Variational Bayes Algorithm for Generalised Linear Mixed Models

Author: Gunawan David
Vu Bao Anh
Zammit-Mangion Andrew
Publication venue
Publication date: 01/06/2023
Field of study

Models with random effects, such as generalised linear mixed models (GLMMs), are often used for analysing clustered data. Parameter inference with these models is difficult because of the presence of cluster-specific random effects, which must be integrated out when evaluating the likelihood function. Here, we propose a sequential variational Bayes algorithm, called Recursive Variational Gaussian Approximation for Latent variable models (R-VGAL), for estimating parameters in GLMMs. The R-VGAL algorithm operates on the data sequentially, requires only a single pass through the data, and can provide parameter updates as new data are collected without the need of re-processing the previous data. At each update, the R-VGAL algorithm requires the gradient and Hessian of a "partial" log-likelihood function evaluated at the new observation, which are generally not available in closed form for GLMMs. To circumvent this issue, we propose using an importance-sampling-based approach for estimating the gradient and Hessian via Fisher's and Louis' identities. We find that R-VGAL can be unstable when traversing the first few data points, but that this issue can be mitigated by using a variant of variational tempering in the initial steps of the algorithm. Through illustrations on both simulated and real datasets, we show that R-VGAL provides good approximations to the exact posterior distributions, that it can be made robust through tempering, and that it is computationally efficient

arXiv.org e-Print Archive