348 research outputs found
Multilingual Coreference Resolution in Multiparty Dialogue
Existing multiparty dialogue datasets for coreference resolution are nascent,
and many challenges are still unaddressed. We create a large-scale dataset,
Multilingual Multiparty Coref (MMC), for this task based on TV transcripts. Due
to the availability of gold-quality subtitles in multiple languages, we propose
reusing the annotations to create silver coreference data in other languages
(Chinese and Farsi) via annotation projection. On the gold (English) data,
off-the-shelf models perform relatively poorly on MMC, suggesting that MMC has
broader coverage of multiparty coreference than prior datasets. On the silver
data, we find success both using it for data augmentation and training from
scratch, which effectively simulates the zero-shot cross-lingual setting
Bootstrapping Multilingual AMR with Contextual Word Alignments
We develop high performance multilingualAbstract Meaning Representation (AMR)
sys-tems by projecting English AMR annotationsto other languages with weak
supervision. Weachieve this goal by bootstrapping transformer-based
multilingual word embeddings, in partic-ular those from cross-lingual RoBERTa
(XLM-R large). We develop a novel technique forforeign-text-to-English AMR
alignment, usingthe contextual word alignment between En-glish and foreign
language tokens. This wordalignment is weakly supervised and relies onthe
contextualized XLM-R word embeddings.We achieve a highly competitive
performancethat surpasses the best published results forGerman, Italian,
Spanish and Chinese
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Analysis
When learning a new skill, you take advantage of your preexisting skills and
knowledge. For instance, if you are a skilled violinist, you will likely have
an easier time learning to play cello. Similarly, when learning a new language
you take advantage of the languages you already speak. For instance, if your
native language is Norwegian and you decide to learn Dutch, the lexical overlap
between these two languages will likely benefit your rate of language
acquisition. This thesis deals with the intersection of learning multiple tasks
and learning multiple languages in the context of Natural Language Processing
(NLP), which can be defined as the study of computational processing of human
language. Although these two types of learning may seem different on the
surface, we will see that they share many similarities.
The traditional approach in NLP is to consider a single task for a single
language at a time. However, recent advances allow for broadening this
approach, by considering data for multiple tasks and languages simultaneously.
This is an important approach to explore further as the key to improving the
reliability of NLP, especially for low-resource languages, is to take advantage
of all relevant data whenever possible. In doing so, the hope is that in the
long term, low-resource languages can benefit from the advances made in NLP
which are currently to a large extent reserved for high-resource languages.
This, in turn, may then have positive consequences for, e.g., language
preservation, as speakers of minority languages will have a lower degree of
pressure to using high-resource languages. In the short term, answering the
specific research questions posed should be of use to NLP researchers working
towards the same goal.Comment: PhD thesis, University of Groninge
Exploiting Cross-Lingual Representations For Natural Language Processing
Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages.
In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest
- …