1,304 research outputs found

    Efficient Numerical Methods to Solve Sparse Linear Equations with Application to PageRank

    Full text link
    In this paper, we propose three methods to solve the PageRank problem for the transition matrices with both row and column sparsity. Our methods reduce the PageRank problem to the convex optimization problem over the simplex. The first algorithm is based on the gradient descent in L1 norm instead of the Euclidean one. The second algorithm extends the Frank-Wolfe to support sparse gradient updates. The third algorithm stands for the mirror descent algorithm with a randomized projection. We proof converges rates for these methods for sparse problems as well as numerical experiments support their effectiveness.Comment: 26 page

    Ergodic Control and Polyhedral approaches to PageRank Optimization

    Full text link
    We study a general class of PageRank optimization problems which consist in finding an optimal outlink strategy for a web site subject to design constraints. We consider both a continuous problem, in which one can choose the intensity of a link, and a discrete one, in which in each page, there are obligatory links, facultative links and forbidden links. We show that the continuous problem, as well as its discrete variant when there are no constraints coupling different pages, can both be modeled by constrained Markov decision processes with ergodic reward, in which the webmaster determines the transition probabilities of websurfers. Although the number of actions turns out to be exponential, we show that an associated polytope of transition measures has a concise representation, from which we deduce that the continuous problem is solvable in polynomial time, and that the same is true for the discrete problem when there are no coupling constraints. We also provide efficient algorithms, adapted to very large networks. Then, we investigate the qualitative features of optimal outlink strategies, and identify in particular assumptions under which there exists a "master" page to which all controlled pages should point. We report numerical results on fragments of the real web graph.Comment: 39 page

    A Web Aggregation Approach for Distributed Randomized PageRank Algorithms

    Full text link
    The PageRank algorithm employed at Google assigns a measure of importance to each web page for rankings in search results. In our recent papers, we have proposed a distributed randomized approach for this algorithm, where web pages are treated as agents computing their own PageRank by communicating with linked pages. This paper builds upon this approach to reduce the computation and communication loads for the algorithms. In particular, we develop a method to systematically aggregate the web pages into groups by exploiting the sparsity inherent in the web. For each group, an aggregated PageRank value is computed, which can then be distributed among the group members. We provide a distributed update scheme for the aggregated PageRank along with an analysis on its convergence properties. The method is especially motivated by results on singular perturbation techniques for large-scale Markov chains and multi-agent consensus.Comment: To appear in the IEEE Transactions on Automatic Control, 201

    PageRank optimization applied to spam detection

    Full text link
    We give a new link spam detection and PageRank demotion algorithm called MaxRank. Like TrustRank and AntiTrustRank, it starts with a seed of hand-picked trusted and spam pages. We define the MaxRank of a page as the frequency of visit of this page by a random surfer minimizing an average cost per time unit. On a given page, the random surfer selects a set of hyperlinks and clicks with uniform probability on any of these hyperlinks. The cost function penalizes spam pages and hyperlink removals. The goal is to determine a hyperlink deletion policy that minimizes this score. The MaxRank is interpreted as a modified PageRank vector, used to sort web pages instead of the usual PageRank vector. The bias vector of this ergodic control problem, which is unique up to an additive constant, is a measure of the "spamicity" of each page, used to detect spam pages. We give a scalable algorithm for MaxRank computation that allowed us to perform experimental results on the WEBSPAM-UK2007 dataset. We show that our algorithm outperforms both TrustRank and AntiTrustRank for spam and nonspam page detection.Comment: 8 pages, 6 figure

    Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation

    Full text link
    A fundamental problem arising in many applications in Web science and social network analysis is, given an arbitrary approximation factor c>1c>1, to output a set SS of nodes that with high probability contains all nodes of PageRank at least Δ\Delta, and no node of PageRank smaller than Δ/c\Delta/c. We call this problem {\sc SignificantPageRanks}. We develop a nearly optimal, local algorithm for the problem with runtime complexity O~(n/Δ)\tilde{O}(n/\Delta) on networks with nn nodes. We show that any algorithm for solving this problem must have runtime of Ω(n/Δ){\Omega}(n/\Delta), rendering our algorithm optimal up to logarithmic factors. Our algorithm comes with two main technical contributions. The first is a multi-scale sampling scheme for a basic matrix problem that could be of interest on its own. In the abstract matrix problem it is assumed that one can access an unknown {\em right-stochastic matrix} by querying its rows, where the cost of a query and the accuracy of the answers depend on a precision parameter ϵ\epsilon. At a cost propositional to 1/ϵ1/\epsilon, the query will return a list of O(1/ϵ)O(1/\epsilon) entries and their indices that provide an ϵ\epsilon-precision approximation of the row. Our task is to find a set that contains all columns whose sum is at least Δ\Delta, and omits any column whose sum is less than Δ/c\Delta/c. Our multi-scale sampling scheme solves this problem with cost O~(n/Δ)\tilde{O}(n/\Delta), while traditional sampling algorithms would take time Θ((n/Δ)2)\Theta((n/\Delta)^2). Our second main technical contribution is a new local algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed in \cite{JehW03,AndersenCL06} and is highly efficient particularly for networks with large in-degrees or out-degrees. Together with our multiscale sampling scheme we are able to optimally solve the {\sc SignificantPageRanks} problem.Comment: Accepted to Internet Mathematics journal for publication. An extended abstract of this paper appeared in WAW 2012 under the title "A Sublinear Time Algorithm for PageRank Computations
    corecore