Search CORE

4 research outputs found

Empirical Gaussian priors for cross-lingual transfer learning

Author: Søgaard Anders
Publication venue
Publication date: 01/01/2015
Field of study

Sequence model learning algorithms typically maximize log-likelihood minus the norm of the model (or minimize Hamming loss + norm). In cross-lingual part-of-speech (POS) tagging, our target language training data consists of sequences of sentences with word-by-word labels projected from translations in

k

languages for which we have labeled data, via word alignments. Our training data is therefore very noisy, and if Rademacher complexity is high, learning algorithms are prone to overfit. Norm-based regularization assumes a constant width and zero mean prior. We instead propose to use the

k

source language models to estimate the parameters of a Gaussian prior for learning new POS taggers. This leads to significantly better performance in multi-source transfer set-ups. We also present a drop-out version that injects (empirical) Gaussian noise during online learning. Finally, we note that using empirical Gaussian priors leads to much lower Rademacher complexity, and is superior to optimally weighted model interpolation.Comment: Presented at NIPS 2015 Workshop on Transfer and Multi-Task Learnin

arXiv.org e-Print Archive

Copenhagen University Research Information System

User Review Sites as a Resource for Large-Scale Sociolinguistic Studies

Author: Ciot M.
Eisenstein J.
Eisenstein J.
Eisenstein J.
Ganitkevitch J.
Jones R.
Liu W.
Nguyen D.
Stenström A.-B.
Søgaard A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/11/2015
Field of study

Sociolinguistic studies investigate the relation between language and extra-linguistic variables. This requires both representative text data and the associated socio-economic meta-data of the subjects. Traditionally, sociolinguistic studies use small samples of hand-curated data and meta-data. This can lead to exaggerated or false conclusions. Us-ing social media data offers a large-scale source of language data, but usually lacks reliable socio-economic meta-data. Our research aims to remedy both problems by exploring a large new data source, international review websites with user profiles. They provide more text data than manually collected studies, and more meta-data than most available social media text. We describe the data and present vari-ous pilot studies, illustrating the usefulness of this resource for sociolinguistic studies. Our approach can help gener-ate new research hypotheses based on data-driven findings across several countries and languages

CiteSeerX

Crossref

Archivio istituzionale della Ricerca - Bocconi

24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Author
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

DSpace at Tartu University Library

Zipfian corruptions for robust POS tagging

Author: Anders Søgaard
Publication venue
Publication date: 01/01/2013
Field of study

Inspired by robust generalization and adversarial learning we describe a novel approach to learning structured perceptrons for part-ofspeech (POS) tagging that is less sensitive to domain shifts. The objective of our method is to minimize average loss under random distribution shifts. We restrict the possible target distributions to mixtures of the source distribution and random Zipfian distributions. Our algorithm is used for POS tagging and evaluated on the English Web Treebank and the Danish Dependency Treebank with an average 4.4 % error reduction in tagging accuracy.

CiteSeerX

Copenhagen University Research Information System