581,205 research outputs found

    Optimal algorithms for selecting top-k combinations of attributes : theory and applications

    Get PDF
    Traditional top-k algorithms, e.g., TA and NRA, have been successfully applied in many areas such as information retrieval, data mining and databases. They are designed to discover k objects, e.g., top-k restaurants, with highest overall scores aggregated from different attributes, e.g., price and location. However, new emerging applications like query recommendation require providing the best combinations of attributes, instead of objects. The straightforward extension based on the existing top-k algorithms is prohibitively expensive to answer top-k combinations because they need to enumerate all the possible combinations, which is exponential to the number of attributes. In this article, we formalize a novel type of top-k query, called top-k, m, which aims to find top-k combinations of attributes based on the overall scores of the top-m objects within each combination, where m is the number of objects forming a combination. We propose a family of efficient top-k, m algorithms with different data access methods, i.e., sorted accesses and random accesses and different query certainties, i.e., exact query processing and approximate query processing. Theoretically, we prove that our algorithms are instance optimal and analyze the bound of the depth of accesses. We further develop optimizations for efficient query evaluation to reduce the computational and the memory costs and the number of accesses. We provide a case study on the real applications of top-k, m queries for an online biomedical search engine. Finally, we perform comprehensive experiments to demonstrate the scalability and efficiency of top-k, m algorithms on multiple real-life datasets.Peer reviewe

    Monte Carlo Methods for Top-k Personalized PageRank Lists and Name Disambiguation

    Get PDF
    We study a problem of quick detection of top-k Personalized PageRank lists. This problem has a number of important applications such as finding local cuts in large graphs, estimation of similarity distance and name disambiguation. In particular, we apply our results to construct efficient algorithms for the person name disambiguation problem. We argue that when finding top-k Personalized PageRank lists two observations are important. Firstly, it is crucial that we detect fast the top-k most important neighbours of a node, while the exact order in the top-k list as well as the exact values of PageRank are by far not so crucial. Secondly, a little number of wrong elements in top-k lists do not really degrade the quality of top-k lists, but it can lead to significant computational saving. Based on these two key observations we propose Monte Carlo methods for fast detection of top-k Personalized PageRank lists. We provide performance evaluation of the proposed methods and supply stopping criteria. Then, we apply the methods to the person name disambiguation problem. The developed algorithm for the person name disambiguation problem has achieved the second place in the WePS 2010 competition

    Approximating the Largest Root and Applications to Interlacing Families

    Full text link
    We study the problem of approximating the largest root of a real-rooted polynomial of degree nn using its top kk coefficients and give nearly matching upper and lower bounds. We present algorithms with running time polynomial in kk that use the top kk coefficients to approximate the maximum root within a factor of n1/kn^{1/k} and 1+O(log⁥nk)21+O(\tfrac{\log n}{k})^2 when k≀log⁥nk\leq \log n and k>log⁥nk>\log n respectively. We also prove corresponding information-theoretic lower bounds of nΩ(1/k)n^{\Omega(1/k)} and 1+Ω(log⁥2nkk)21+\Omega\left(\frac{\log \frac{2n}{k}}{k}\right)^2, and show strong lower bounds for noisy version of the problem in which one is given access to approximate coefficients. This problem has applications in the context of the method of interlacing families of polynomials, which was used for proving the existence of Ramanujan graphs of all degrees, the solution of the Kadison-Singer problem, and bounding the integrality gap of the asymmetric traveling salesman problem. All of these involve computing the maximum root of certain real-rooted polynomials for which the top few coefficients are accessible in subexponential time. Our results yield an algorithm with the running time of 2O~(n3)2^{\tilde O(\sqrt[3]n)} for all of them

    OPTIMAL CONSTRUCTION OF A LAYER-ORDERED HEAP AND ITS APPLICATIONS

    Get PDF
    The layer-ordered heap (LOH) is a simple data structure used in algorithms that perform optimal top-kk on X+YX+Y, algorithms with the best known runtime for top-kk on X1+X2+⋯+XmX_1+X_2+\cdots+X_m, and the fastest method in practice for computing the most abundant isotopologue peaks in a chemical compound. In the analysis of these algorithms, the rank, α\alpha, has been treated as a constant and nn, the size of the array, has been treated as the sole parameter. Here, we explore the algorithmic complexity of LOH construction with α\alpha as a parameter, introduce a few algorithms for constructing LOHs, analyze their complexity in both nn and α\alpha, and demonstrate that one algorithm is optimal in both nn and α\alpha for building a LOH of any rank. We then apply this to improve performance in applications where they are employed, find an estimate for the optimal α\alpha given an nn and kk for top-kk on X+YX+Y, and derive a novel algorithm for top-kk on a multinomial distribution. Finally, we show that the results of our LOH analysis correspond with empirical experiments of runtimes when applying the LOH construction algorithms to both a common task in machine learning and top-kk on X1+X2+⋯+XmX_1+X_2+\cdots+X_m and that our estimate of the optimal α\alpha for top-kk on X+YX+Y corresponds well with empirical data

    Surrogate Functions for Maximizing Precision at the Top

    Full text link
    The problem of maximizing precision at the top of a ranked list, often dubbed Precision@k (prec@k), finds relevance in myriad learning applications such as ranking, multi-label classification, and learning with severe label imbalance. However, despite its popularity, there exist significant gaps in our understanding of this problem and its associated performance measure. The most notable of these is the lack of a convex upper bounding surrogate for prec@k. We also lack scalable perceptron and stochastic gradient descent algorithms for optimizing this performance measure. In this paper we make key contributions in these directions. At the heart of our results is a family of truly upper bounding surrogates for prec@k. These surrogates are motivated in a principled manner and enjoy attractive properties such as consistency to prec@k under various natural margin/noise conditions. These surrogates are then used to design a class of novel perceptron algorithms for optimizing prec@k with provable mistake bounds. We also devise scalable stochastic gradient descent style methods for this problem with provable convergence bounds. Our proofs rely on novel uniform convergence bounds which require an in-depth analysis of the structural properties of prec@k and its surrogates. We conclude with experimental results comparing our algorithms with state-of-the-art cutting plane and stochastic gradient algorithms for maximizing [email protected]: To appear in the the proceedings of the 32nd International Conference on Machine Learning (ICML 2015

    Distributed top-k aggregation queries at large

    Get PDF
    Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network

    Quasi-Convex Scoring Functions in Branch-and-Bound Ranked Search

    Get PDF
    For answering top-k queries in which attributes are aggregated to a scalar value for defining a ranking, usually the well-known branch-and-bound principle can be used for efficient query answering. Standard algorithms (e.g., Branch-and-Bound Ranked Search, BRS for short) require scoring functions to be monotone, such that a top-k ranking can be computed in sublinear time in the average case. If monotonicity cannot be guaranteed, efficient query answering algorithms are not known. To make branch-and-bound effective with descending or ascending rankings (maximum top-k or minimum top-k queries, respectively), BRS must be able to identify bounds for exploring search partitions, and only for monotonic ranking functions this is trivial. In this paper, we investigate the class of quasi-convex functions used for scoring objects, and we examine how bounds for exploring data partitions can correctly and efficiently be computed for quasi-convex functions in BRS for maximum top-k queries. Given that quasi-convex scoring functions can usefully be employed for ranking objects in a variety of applications, the mathematical findings presented in this paper are indeed significant for practical top-k query answering
    • 

    corecore