405 research outputs found
Producing power-law distributions and damping word frequencies with two-stage language models
Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statisticalmodels that can generically produce power laws, breaking generativemodels into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes-the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process-that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology.48 page(s
Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches
We demonstrate the effectiveness of multilingual learning for unsupervised
part-of-speech tagging. The central assumption of our work is that by combining
cues from multiple languages, the structure of each becomes more apparent. We
consider two ways of applying this intuition to the problem of unsupervised
part-of-speech tagging: a model that directly merges tag structures for a pair
of languages into a single sequence and a second model which instead
incorporates multilingual context using latent variables. Both approaches are
formulated as hierarchical Bayesian models, using Markov Chain Monte Carlo
sampling techniques for inference. Our results demonstrate that by
incorporating multilingual evidence we can achieve impressive performance gains
across a range of scenarios. We also found that performance improves steadily
as the number of available languages increases
Experimental Evaluation of Representation Models for Content Recommendation in Microblogging Services
Οι microblogging υπηρεσίες αποτελούν έναν ευρέως διαδεδομένο τρόπο ανταλλαγής
πληροφοριών και επικοινωνίας σε πραγματικό χρόνο. Το Twitter είναι η πιο
δημοφιλής microblogging υπηρεσία, αφού επί του παρόντος συγκεντρώνει 300
εκατομμύρια ενεργούς χρήστες μηνιαίως και καταγράφει 500 εκατομμύρια tweets
ημερησίως. Για να αντιμετωπιστεί ο καταιγισμός πληροφοριών των χρηστών του
Twitter, έχουν προταθεί ποικίλες μέθοδοι συστάσεων για την ανακατάταξη των
tweets στο χρονολόγιο ενός χρήστη, σύμφωνα με τα ενδιαφέροντά του. Στη παρούσα
διπλωματική εργασία εστιάζουμε σε τεχνικές που αρχικά κατασκευάζουν ένα μοντέλο
για κάθε χρήστη ξεχωριστά, με στόχο να απεικονίσουν τις προτιμήσεις του και στη
συνέχεια κατατάσσουν τα tweets του χρήστη με βάση την ομοιότητά τους με το
μοντέλο αυτό.
Στη βιβλιογραφία, μέχρι στιγμής, δεν υπάρχει περιεκτική αποτίμηση των
στρατηγικών μοντελοποίησης χρηστών. Για να καλύψουμε το κενό αυτό, εξετάζουμε
διεξοδικά σε ένα πραγματικό σύνολο δεδομένων του Twitter, σύγχρονες μεθόδους
για τη μοντελοποίηση των προτιμήσεων ενός χρήστη, χρησιμοποιώντας αποκλειστικά
πληροφορία σε μορφή κειμένου. Ο στόχος μας είναι να προσδιορίσουμε το πιο
αποδοτικό μοντέλο χρήστη σε σχέση με τα ακόλουθα κριτήρια: (1) την πηγή της
πληροφορίας σχετική με tweets που χρησιμοποιείται για την μοντελοποίηση, (2) το
είδος του χρήστη, όπως προσδιορίζεται από τη σχέση μεταξύ της συχνότητας των
tweets που ανεβάζει ο ίδιος και της συχνότητας αυτών που λαμβάνει, (3) τα
χαρακτηριστικά της λειτουργικότητάς του, όπως προκύπτουν από μια πρωτότυπη
ταξινόμηση, (4) την ευρωστία του σε σχέση με τις εσωτερικές του παραμέτρους. Τα
αποτελέσματά μας μπορούν να αξιοποιηθούν για την ρύθμιση και ερμηνεία μοντέλων
χρηστών βασισμένων σε κείμενο, με στόχο συστάσεις σε microblogging υπηρεσίες
και λειτουργούν σαν σημείο εκκίνησης για την ενίσχυση του καλύτερου μοντέλου με
επιπλέον συναφή εξωτερική πληροφορία.Micro-blogging services constitute a popular means of real time communication
and information sharing. Twitter is the most popular of these services with 300
million monthly active user accounts and 500 million tweets posted in a daily
basis at the moment. Consequently, Twitter users suffer from an information
deluge and a large number of recommendation methods have been proposed to
re-rank the tweets in a user's timeline according to her interests. We focus on
techniques that build a textual model for every individual user to capture her
tastes and then rank the tweets she receives according to their similarity with
that model.
In the literature, there is no comprehensive evaluation of these user modeling
strategies as yet. To cover this gap, in this thesis we systematically examine
on a real Twitter dataset, 9 state-of-the-art methods for modeling a user's
preferences using exclusively textual information. Our goal is to identify the
best performing user model with respect to several criteria: (i) the source of
tweet information available for modeling (ii) the user type, as determined by
the relation between the tweeting frequency of a user and the frequency of her
received tweets, (iii) the characteristics of its functionality, as derived
from a novel taxonomy, and (iv) its robustness with respect to its internal
configurations, as deduced by assessing a wide range of plausible values for
internal parameters. Our results can be used for fine-tuning and interpreting
text user models in a recommendation scenario in microblogging services and
could serve as a starting point for further enhancing the most effective user
model with additional contextual information
Minimax estimation of smooth optimal transport maps
Brenier's theorem is a cornerstone of optimal transport that guarantees the
existence of an optimal transport map between two probability distributions
and over under certain regularity conditions. The main
goal of this work is to establish the minimax estimation rates for such a
transport map from data sampled from and under additional smoothness
assumptions on . To achieve this goal, we develop an estimator based on the
minimization of an empirical version of the semi-dual optimal transport
problem, restricted to truncated wavelet expansions. This estimator is shown to
achieve near minimax optimality using new stability arguments for the semi-dual
and a complementary minimax lower bound. Furthermore, we provide numerical
experiments on synthetic data supporting our theoretical findings and
highlighting the practical benefits of smoothness regularization. These are the
first minimax estimation rates for transport maps in general dimension.Comment: 53 pages, 6 figure
Unsupervised multilingual learning
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 241-254).For centuries, scholars have explored the deep links among human languages. In this thesis, we present a class of probabilistic models that exploit these links as a form of naturally occurring supervision. These models allow us to substantially improve performance for core text processing tasks, such as morphological segmentation, part-of-speech tagging, and syntactic parsing. Besides these traditional NLP tasks, we also present a multilingual model for lost language deciphersment. We test this model on the ancient Ugaritic language. Our results show that we can automatically uncover much of the historical relationship between Ugaritic and Biblical Hebrew, a known related language.by Benjamin Snyder.Ph.D
Posterior Regularization for Learning with Side Information and Weak Supervision
Supervised machine learning techniques have been very successful for a variety of tasks and domains including natural language processing, computer vision, and computational biology. Unfortunately, their use often requires creation of large problem-specific training corpora that can make these methods prohibitively expensive. At the same time, we often have access to external problem-specific information that we cannot alway easily incorporate. We might know how to solve the problem in another domain (e.g. for a different language); we might have access to cheap but noisy training data; or a domain expert might be available who would be able to guide a human learner much more efficiently than by simply creating an IID training corpus. A key challenge for weakly supervised learning is then how to incorporate such kinds of auxiliary information arising from indirect supervision.
In this thesis, we present Posterior Regularization, a probabilistic framework for structured, weakly supervised learning. Posterior Regularization is applicable to probabilistic models with latent variables and exports a language for specifying constraints or preferences about posterior distributions of latent variables. We show that this language is powerful enough to specify realistic prior knowledge for a variety applications in natural language processing. Additionally, because Posterior Regularization separates model complexity from the complexity of structural constraints, it can be used for structured problems with relatively little computational overhead. We apply Posterior Regularization to several problems in natural language processing including word alignment for machine translation, transfer of linguistic resources across languages and grammar induction. Additionally, we find that we can apply Posterior Regularization to the problem of multi-view learning, achieving particularly good results for transfer learning. We also explore the theoretical relationship between Posterior Regularization and other proposed frameworks for encoding this kind of prior knowledge, and show a close relationship to Constraint Driven Learning as well as to Generalized Expectation Constraints
- …