15 research outputs found
Recommended from our members
Learning to rank in supervised and unsupervised settings using convexity and monotonicity
textThis dissertation addresses the task of learning to rank, both in the supervised and unsupervised settings, by exploiting the interplay of convex functions, monotonic mappings and their fixed points. In the supervised setting of learning to rank, one wishes to learn from examples of correctly ordered items whereas in the unsupervised setting, one tries to maximize some quantitatively defined characteristic of a "good" ranking. A ranking method selects one permutation from among the combinatorially many permutations defined on the items to rank. Accomplishing this optimally in the supervised setting, with minimal loss in generality, if any, is challenging. In this dissertation this problem is addressed by optimizing, globally and efficiently, a statistically consistent loss functional over the class of compositions of a linear function by an arbitrary, strictly monotonic, separable mapping with large margins. This capability also enables learning the parameters of a generalized linear model with an unknown link function. The method can handle infinite dimensional feature spaces if the corresponding kernel function is known. In the unsupervised setting, a popular ranking approach is is link analysis over a graph of recommendations, as exemplified by pagerank. This dissertation shows that pagerank may be viewed as an instance of an unsupervised consensus optimization problem. The dissertation then solves a more general problem of unsupervised consensus over noisy, directed recommendation graphs that have uncertainty over the set of "out" edges that emanate from a vertex. The proposed consensus rank is essentially the pagerank over the expected edge-set, where the expectation is computed over the distribution that achieves the most agreeable consensus. This consensus is measured geometrically by a suitable Bregman divergence between the consensus rank and the ranks induced by item specific distributions Real world deployed ranking methods need to be resistant to spam, a particularly sophisticated type of which is link-spam. A popular class of countermeasures "de-spam" the corrupted webgraph by removing abusive pages identified by supervised learning. Since exhaustive detection and neutralization is infeasible, there is a need for ranking functions that can, on one hand, attenuate the effects of link-spam without supervision and on the other hand, counter spam more aggressively when supervision is available. A family of non-linear, iteratively defined monotonic functions is proposed that propagates "rank" and "trust" scores through the webgraph. It relies on non-linearity, monotonicity and Schurconvexity to provide the resistance against spam.Electrical and Computer Engineerin
Transductive De-Noising and Dimensionality Reduction using Total Bregman Regression
Our goal on one hand is to use labels or other forms of ground truth data to guide the tasks of de-noising and dimensionality reduction and balance the objectives of better prediction and better data summarization, on the other hand it is to explicitly model the noise in the feature values. We use a generalization of L2 loss, on which PCA and K-Means are based, to the Bregman family which, as a consequence widens the applicability of the proposed algorithms to cases where the data may be constrained to lie on sets of integers or sets of labels rather than Rd as in PCA or K-means formulations. This makes it possible to handle different prediction tasks such as classification and regression in an unified way. Two tasks are formulated (i) Transductive Total Bregman Regression (ii) Transductive Bregman PCA.
Outlink Estimation for Pagerank Computation under Missing Data
The enormity and rapid growth of the web-graph forces quantities such as its pagerank to be computed under missing information consisting of outlinks of pages that have not yet been crawled. This paper examines the role played by the size and distribution of this missing data in determining the accuracy of the computed pagerank, focusing on questions such as (i) the accuracy of pageranks under missing information, (ii) the size at which a crawl process may be aborted while still ensuring reasonable accuracy of pageranks, and (iii) algorithms to estimate pageranks under such missing information. The first couple of questions are addressed on the basis of certain simple bounds relating the expected distance between the true and computed pageranks and the size of the missing data. The third question is explored by devising algorithms to predict the pageranks when full information is not available. A key feature of the "dangling link estimation" and "clustered link estimation" algorithms proposed is that, they do not need to run the pagerank iteration afresh once the outlinks have been estimated
Context-Sensitive Modeling of Web-Surfing Behaviour using Concept Trees
Early approaches to mathematically abstracting websurfing behavior were largely based on first-order Markov models. Most humans however do not surf in a "memoryless " fashion, rather they are guided by their timedependent situational context and associated information needs. This belief is corroborated by the non-exponential revisit times observed in many site-centric weblogs. In this paper, we propose a general framework for modeling users whose surfing behavior is dynamically governed by their current topic of interest. This allows a modeled surfer to behave differently on the same page, depending on his situational context. The proposed methodology involves mapping each visited page to a topic or concept, (conceptually) imposing a tree hierarchy on these topics, and then estimating the parameters of a semi-Markov process defined on this tree based on the observed transitions among the underlying visited pages. The semi-Markovian assumption imparts additional flexibility by allowing for non-exponential state re-visit times, and the concept hierarchy provides a nice way of capturing context and user intent. Our approach is computationally much less demanding as compared to the alternative approach of using higher order Markov models for capturing history-sensitive surfing behavior. Several practical applications are described. The application of better predicting which outlink a surfer may take, is illustrated using web-log data from a rich community portal, www.sulekha.com as an example, though the focus of the paper is on forming a plausible generative model rather than solving any specific task
Bregman Divergences and Triangle Inequality
While Bregman divergences have been used for clustering and embedding problems in recent years, the facts that they are asymmetric and do not satisfy triangle inequality have been a major concern. In this paper, we investigate the relationship between two families of symmetrized Bregman divergences and metrics that satisfy the triangle inequality. The first family can be derived from any well-behaved convex function. The second family generalizes the Jensen-Shannon divergence, and can only be derived from convex functions with certain conditional positive definiteness structure. We interpret the required structure in terms of cumulants of infinitely divisible distributions, and related results in harmonic analysis. We investigate kmeans-type clustering problems using both families of symmetrized divergences, and give efficient algorithms for the same.