2,130 research outputs found

    Personalised Search Time Prediction using Markov Chains

    Get PDF
    For improving the effectiveness of Interactive Information Retrieval (IIR), a system should minimise the search time by guiding the user appropriately. As a prerequisite, in any search situation, the system must be able to estimate the time the user will need for finding the next relevant document. In this paper, we show how Markov models derived from search logs can be used for predicting search times, and describe a method for evaluating these predictions. For personalising the predictions based upon a few user events observed, we devise appropriate parameter estimation methods. Our experimental results show that by observing users for only 100 seconds, the personalised predictions are already significantly better than global predictions

    Maximum Coverage in the Data Stream Model: Parameterized and Generalized

    Get PDF
    We present algorithms for the Max-Cover and Max-Unique-Cover problems in the data stream model. The input to both problems are mm subsets of a universe of size nn and a value k∈[m]k\in [m]. In Max-Cover, the problem is to find a collection of at most kk sets such that the number of elements covered by at least one set is maximized. In Max-Unique-Cover, the problem is to find a collection of at most kk sets such that the number of elements covered by exactly one set is maximized. Our goal is to design single-pass algorithms that use space that is sublinear in the input size. Our main algorithmic results are: If the sets have size at most dd, there exist single-pass algorithms using O~(dd+1kd)\tilde{O}(d^{d+1} k^d) space that solve both problems exactly. This is optimal up to polylogarithmic factors for constant dd. If each element appears in at most rr sets, we present single pass algorithms using O~(k2r/ϵ3)\tilde{O}(k^2 r/\epsilon^3) space that return a 1+ϵ1+\epsilon approximation in the case of Max-Cover. We also present a single-pass algorithm using slightly more memory, i.e., O~(k3r/ϵ4)\tilde{O}(k^3 r/\epsilon^{4}) space, that 1+ϵ1+\epsilon approximates Max-Unique-Cover. In contrast to the above results, when dd and rr are arbitrary, any constant pass 1+ϵ1+\epsilon approximation algorithm for either problem requires Ω(ϵ−2m)\Omega(\epsilon^{-2}m) space but a single pass O(ϵ−2mk)O(\epsilon^{-2}mk) space algorithm exists. In fact any constant-pass algorithm with an approximation better than e/(e−1)e/(e-1) and e1−1/ke^{1-1/k} for Max-Cover and Max-Unique-Cover respectively requires Ω(m/k2)\Omega(m/k^2) space when dd and rr are unrestricted. En route, we also obtain an algorithm for a parameterized version of the streaming Set-Cover problem.Comment: Conference version to appear at ICDT 202

    Model-based clustering of large networks

    Get PDF
    We describe a network clustering framework, based on finite mixture models, that can be applied to discrete-valued networks with hundreds of thousands of nodes and billions of edge variables. Relative to other recent model-based clustering work for networks, we introduce a more flexible modeling framework, improve the variational-approximation estimation algorithm, discuss and implement standard error estimation via a parametric bootstrap approach, and apply these methods to much larger data sets than those seen elsewhere in the literature. The more flexible framework is achieved through introducing novel parameterizations of the model, giving varying degrees of parsimony, using exponential family models whose structure may be exploited in various theoretical and algorithmic ways. The algorithms are based on variational generalized EM algorithms, where the E-steps are augmented by a minorization-maximization (MM) idea. The bootstrapped standard error estimates are based on an efficient Monte Carlo network simulation idea. Last, we demonstrate the usefulness of the model-based clustering framework by applying it to a discrete-valued network with more than 131,000 nodes and 17 billion edge variables

    R-VGAL: A Sequential Variational Bayes Algorithm for Generalised Linear Mixed Models

    Full text link
    Models with random effects, such as generalised linear mixed models (GLMMs), are often used for analysing clustered data. Parameter inference with these models is difficult because of the presence of cluster-specific random effects, which must be integrated out when evaluating the likelihood function. Here, we propose a sequential variational Bayes algorithm, called Recursive Variational Gaussian Approximation for Latent variable models (R-VGAL), for estimating parameters in GLMMs. The R-VGAL algorithm operates on the data sequentially, requires only a single pass through the data, and can provide parameter updates as new data are collected without the need of re-processing the previous data. At each update, the R-VGAL algorithm requires the gradient and Hessian of a "partial" log-likelihood function evaluated at the new observation, which are generally not available in closed form for GLMMs. To circumvent this issue, we propose using an importance-sampling-based approach for estimating the gradient and Hessian via Fisher's and Louis' identities. We find that R-VGAL can be unstable when traversing the first few data points, but that this issue can be mitigated by using a variant of variational tempering in the initial steps of the algorithm. Through illustrations on both simulated and real datasets, we show that R-VGAL provides good approximations to the exact posterior distributions, that it can be made robust through tempering, and that it is computationally efficient
    • …
    corecore