5 research outputs found

    Crosslingual Document Embedding as Reduced-Rank Ridge Regression

    Get PDF
    There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia. Our method, Cr5 (Crosslingual reduced-rank ridge regression), starts by training a ridge-regression-based classifier that uses language-specific bag-of-word features in order to predict the concept that a given document is about. We show that, when constraining the learned weight matrix to be of low rank, it can be factored to obtain the desired mappings from language-specific bags-of-words to language-independent embeddings. As opposed to most prior methods, which use pretrained monolingual word vectors, postprocess them to make them crosslingual, and finally average word vectors to obtain document vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as document-level. Moreover, since our algorithm uses the singular value decomposition as its core operation, it is highly scalable. Experiments show that our method achieves state-of-the-art performance on a crosslingual document retrieval task. Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19

    Exploiting Social Network Structure for Person-to-Person Sentiment Analysis

    Full text link
    Person-to-person evaluations are prevalent in all kinds of discourse and important for establishing reputations, building social bonds, and shaping public opinion. Such evaluations can be analyzed separately using signed social networks and textual sentiment analysis, but this misses the rich interactions between language and social context. To capture such interactions, we develop a model that predicts individual A's opinion of individual B by synthesizing information from the signed social network in which A and B are embedded with sentiment analysis of the evaluative texts relating A to B. We prove that this problem is NP-hard but can be relaxed to an efficiently solvable hinge-loss Markov random field, and we show that this implementation outperforms text-only and network-only versions in two very different datasets involving community-level decision-making: the Wikipedia Requests for Adminship corpus and the Convote U.S. Congressional speech corpus

    Multiclass extensions of Regularized Least Squares

    No full text
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 81-83).We consider the problem of building a viable multiclass classification system that minimizes training data, is robust to noisy, imbalanced samples, and outputs confidence scores along with its predications. These goals address critical steps along the entire classification pipeline that pertain to collecting data, training, and classifying. To this end, we investigate the merits of a classification framework that uses a robust algorithm known as Regularized Least Squares (RLS) as its basic classifier. We extend RLS to account for data imbalances, perform efficient active learning, and output confidence scores. Each of these extensions is a new result that combines with our other findings to give an altogether novel and effective classification system. Our first set of results investigates various ways to handle multiclass data imbalances and ultimately leads to a derivation of a weighted version of RLS with and without an offset term. Weighting RLS provides an effective countermeasure to imbalanced data and facilitates the automatic selection of a regularization parameter through exact and efficient calculation of the Leave One Out error. Next, we present two methods that estimate multiclass confidence from an asymptotic analysis of RLS and another method that stems from a Bayesian interpretation of the classifier. We show that while the third method incorporates more information in its estimate, the asymptotic methods are more accurate and resilient to imperfect kernel and regularization parameter choices. Finally, we present an active learning extension of RLS (ARLS) that uses our weighting methods to overcome imbalanced data. ARLS is particularly adept to this task because of its intelligent selection scheme.by Hristo Spassimirov Paskov.M.Eng

    On the feasibility of internet-scale author identification

    No full text
    Abstract—We study techniques for identifying an anonymous author via linguistic stylometry, i.e., comparing the writing style against a corpus of texts of known authorship. We experimentally demonstrate the effectiveness of our techniques with as many as 100,000 candidate authors. Given the increasing availability of writing samples online, our result has serious implications for anonymity and free speech — an anonymous blogger or whistleblower may be unmasked unless they take steps to obfuscate their writing style. While there is a huge body of literature on authorship recognition based on writing style, almost none of it has studied corpora of more than a few hundred authors. The problem becomes qualitatively different at a large scale, as we show, and techniques from prior work fail to scale, both in terms of accuracy and performance. We study a variety of classifiers, both “lazy ” and “eager, ” and show how to handle the huge number of classes. We also develop novel techniques for confidence estimation of classifier outputs. Finally, we demonstrate stylometric authorship recognition on texts written in different contexts. In over 20 % of cases, our classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors; in about 35 % of cases the correct author is one of the top 20 guesses. If we allow the classifier the option of not making a guess, via confidence estimation we are able to increase the precision of the top guess from 20 % to over 80% with only a halving of recall. I
    corecore