405 research outputs found

    Producing power-law distributions and damping word frequencies with two-stage language models

    Get PDF
    Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statisticalmodels that can generically produce power laws, breaking generativemodels into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes-the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process-that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology.48 page(s

    Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches

    Full text link
    We demonstrate the effectiveness of multilingual learning for unsupervised part-of-speech tagging. The central assumption of our work is that by combining cues from multiple languages, the structure of each becomes more apparent. We consider two ways of applying this intuition to the problem of unsupervised part-of-speech tagging: a model that directly merges tag structures for a pair of languages into a single sequence and a second model which instead incorporates multilingual context using latent variables. Both approaches are formulated as hierarchical Bayesian models, using Markov Chain Monte Carlo sampling techniques for inference. Our results demonstrate that by incorporating multilingual evidence we can achieve impressive performance gains across a range of scenarios. We also found that performance improves steadily as the number of available languages increases

    Experimental Evaluation of Representation Models for Content Recommendation in Microblogging Services

    Get PDF
    Οι microblogging υπηρεσίες αποτελούν έναν ευρέως διαδεδομένο τρόπο ανταλλαγής πληροφοριών και επικοινωνίας σε πραγματικό χρόνο. Το Twitter είναι η πιο δημοφιλής microblogging υπηρεσία, αφού επί του παρόντος συγκεντρώνει 300 εκατομμύρια ενεργούς χρήστες μηνιαίως και καταγράφει 500 εκατομμύρια tweets ημερησίως. Για να αντιμετωπιστεί ο καταιγισμός πληροφοριών των χρηστών του Twitter, έχουν προταθεί ποικίλες μέθοδοι συστάσεων για την ανακατάταξη των tweets στο χρονολόγιο ενός χρήστη, σύμφωνα με τα ενδιαφέροντά του. Στη παρούσα διπλωματική εργασία εστιάζουμε σε τεχνικές που αρχικά κατασκευάζουν ένα μοντέλο για κάθε χρήστη ξεχωριστά, με στόχο να απεικονίσουν τις προτιμήσεις του και στη συνέχεια κατατάσσουν τα tweets του χρήστη με βάση την ομοιότητά τους με το μοντέλο αυτό. Στη βιβλιογραφία, μέχρι στιγμής, δεν υπάρχει περιεκτική αποτίμηση των στρατηγικών μοντελοποίησης χρηστών. Για να καλύψουμε το κενό αυτό, εξετάζουμε διεξοδικά σε ένα πραγματικό σύνολο δεδομένων του Twitter, σύγχρονες μεθόδους για τη μοντελοποίηση των προτιμήσεων ενός χρήστη, χρησιμοποιώντας αποκλειστικά πληροφορία σε μορφή κειμένου. Ο στόχος μας είναι να προσδιορίσουμε το πιο αποδοτικό μοντέλο χρήστη σε σχέση με τα ακόλουθα κριτήρια: (1) την πηγή της πληροφορίας σχετική με tweets που χρησιμοποιείται για την μοντελοποίηση, (2) το είδος του χρήστη, όπως προσδιορίζεται από τη σχέση μεταξύ της συχνότητας των tweets που ανεβάζει ο ίδιος και της συχνότητας αυτών που λαμβάνει, (3) τα χαρακτηριστικά της λειτουργικότητάς του, όπως προκύπτουν από μια πρωτότυπη ταξινόμηση, (4) την ευρωστία του σε σχέση με τις εσωτερικές του παραμέτρους. Τα αποτελέσματά μας μπορούν να αξιοποιηθούν για την ρύθμιση και ερμηνεία μοντέλων χρηστών βασισμένων σε κείμενο, με στόχο συστάσεις σε microblogging υπηρεσίες και λειτουργούν σαν σημείο εκκίνησης για την ενίσχυση του καλύτερου μοντέλου με επιπλέον συναφή εξωτερική πληροφορία.Micro-blogging services constitute a popular means of real time communication and information sharing. Twitter is the most popular of these services with 300 million monthly active user accounts and 500 million tweets posted in a daily basis at the moment. Consequently, Twitter users suffer from an information deluge and a large number of recommendation methods have been proposed to re-rank the tweets in a user's timeline according to her interests. We focus on techniques that build a textual model for every individual user to capture her tastes and then rank the tweets she receives according to their similarity with that model. In the literature, there is no comprehensive evaluation of these user modeling strategies as yet. To cover this gap, in this thesis we systematically examine on a real Twitter dataset, 9 state-of-the-art methods for modeling a user's preferences using exclusively textual information. Our goal is to identify the best performing user model with respect to several criteria: (i) the source of tweet information available for modeling (ii) the user type, as determined by the relation between the tweeting frequency of a user and the frequency of her received tweets, (iii) the characteristics of its functionality, as derived from a novel taxonomy, and (iv) its robustness with respect to its internal configurations, as deduced by assessing a wide range of plausible values for internal parameters. Our results can be used for fine-tuning and interpreting text user models in a recommendation scenario in microblogging services and could serve as a starting point for further enhancing the most effective user model with additional contextual information

    Modelling Digital Media Objects

    Get PDF

    Minimax estimation of smooth optimal transport maps

    Full text link
    Brenier's theorem is a cornerstone of optimal transport that guarantees the existence of an optimal transport map TT between two probability distributions PP and QQ over Rd\mathbb{R}^d under certain regularity conditions. The main goal of this work is to establish the minimax estimation rates for such a transport map from data sampled from PP and QQ under additional smoothness assumptions on TT. To achieve this goal, we develop an estimator based on the minimization of an empirical version of the semi-dual optimal transport problem, restricted to truncated wavelet expansions. This estimator is shown to achieve near minimax optimality using new stability arguments for the semi-dual and a complementary minimax lower bound. Furthermore, we provide numerical experiments on synthetic data supporting our theoretical findings and highlighting the practical benefits of smoothness regularization. These are the first minimax estimation rates for transport maps in general dimension.Comment: 53 pages, 6 figure

    Unsupervised multilingual learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 241-254).For centuries, scholars have explored the deep links among human languages. In this thesis, we present a class of probabilistic models that exploit these links as a form of naturally occurring supervision. These models allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Besides these traditional NLP tasks, we also present a multilingual model for lost language deciphersment. We test this model on the ancient Ugaritic language. Our results show that we can automatically uncover much of the historical relationship between Ugaritic and Biblical Hebrew, a known related language.by Benjamin Snyder.Ph.D

    Posterior Regularization for Learning with Side Information and Weak Supervision

    Get PDF
    Supervised machine learning techniques have been very successful for a variety of tasks and domains including natural language processing, computer vision, and computational biology. Unfortunately, their use often requires creation of large problem-specific training corpora that can make these methods prohibitively expensive. At the same time, we often have access to external problem-specific information that we cannot alway easily incorporate. We might know how to solve the problem in another domain (e.g. for a different language); we might have access to cheap but noisy training data; or a domain expert might be available who would be able to guide a human learner much more efficiently than by simply creating an IID training corpus. A key challenge for weakly supervised learning is then how to incorporate such kinds of auxiliary information arising from indirect supervision. In this thesis, we present Posterior Regularization, a probabilistic framework for structured, weakly supervised learning. Posterior Regularization is applicable to probabilistic models with latent variables and exports a language for specifying constraints or preferences about posterior distributions of latent variables. We show that this language is powerful enough to specify realistic prior knowledge for a variety applications in natural language processing. Additionally, because Posterior Regularization separates model complexity from the complexity of structural constraints, it can be used for structured problems with relatively little computational overhead. We apply Posterior Regularization to several problems in natural language processing including word alignment for machine translation, transfer of linguistic resources across languages and grammar induction. Additionally, we find that we can apply Posterior Regularization to the problem of multi-view learning, achieving particularly good results for transfer learning. We also explore the theoretical relationship between Posterior Regularization and other proposed frameworks for encoding this kind of prior knowledge, and show a close relationship to Constraint Driven Learning as well as to Generalized Expectation Constraints
    corecore