1,224 research outputs found
A Universal Part-of-Speech Tagset
To facilitate future research in unsupervised induction of syntactic
structure and to standardize best-practices, we propose a tagset that consists
of twelve universal part-of-speech categories. In addition to the tagset, we
develop a mapping from 25 different treebank tagsets to this universal set. As
a result, when combined with the original treebank data, this universal tagset
and mapping produce a dataset consisting of common parts-of-speech for 22
different languages. We highlight the use of this resource via two experiments,
including one that reports competitive accuracies for unsupervised grammar
induction without gold standard part-of-speech tags
Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT
Pretrained contextual representation models (Peters et al., 2018; Devlin et
al., 2018) have pushed forward the state-of-the-art on many NLP tasks. A new
release of BERT (Devlin, 2018) includes a model simultaneously pretrained on
104 languages with impressive performance for zero-shot cross-lingual transfer
on a natural language inference task. This paper explores the broader
cross-lingual potential of mBERT (multilingual) as a zero shot language
transfer model on 5 NLP tasks covering a total of 39 languages from various
language families: NLI, document classification, NER, POS tagging, and
dependency parsing. We compare mBERT with the best-published methods for
zero-shot cross-lingual transfer and find mBERT competitive on each task.
Additionally, we investigate the most effective strategy for utilizing mBERT in
this manner, determine to what extent mBERT generalizes away from language
specific features, and measure factors that influence cross-lingual transfer.Comment: EMNLP 2019 Camera Read
Statistical parsing of morphologically rich languages (SPMRL): what, how and whither
The term Morphologically Rich Languages (MRLs) refers to languages in which significant information concerning syntactic units and relations is expressed at word-level. There is ample evidence that the application of readily available statistical parsing models to such languages is susceptible to serious performance degradation. The first workshop on statistical parsing of MRLs hosts a variety of contributions which show that despite language-specific idiosyncrasies, the problems associated with parsing MRLs cut across languages and parsing frameworks. In this paper we review the current state-of-affairs with respect to parsing MRLs and point out central challenges. We synthesize the contributions of researchers working on parsing Arabic, Basque, French, German, Hebrew, Hindi and Korean to point out shared solutions across languages. The overarching analysis suggests itself as a source of directions for future investigations
Zero-shot Dependency Parsing with Pre-trained Multilingual Sentence Representations
We investigate whether off-the-shelf deep bidirectional sentence
representations trained on a massively multilingual corpus (multilingual BERT)
enable the development of an unsupervised universal dependency parser. This
approach only leverages a mix of monolingual corpora in many languages and does
not require any translation data making it applicable to low-resource
languages. In our experiments we outperform the best CoNLL 2018
language-specific systems in all of the shared task's six truly low-resource
languages while using a single system. However, we also find that (i) parsing
accuracy still varies dramatically when changing the training languages and
(ii) in some target languages zero-shot transfer fails under all tested
conditions, raising concerns on the 'universality' of the whole approach.Comment: DeepLo workshop, EMNLP 201
Unlocking Bias Detection: Leveraging Transformer-Based Models for Content Analysis
Bias detection in text is imperative due to its role in reinforcing negative
stereotypes, disseminating misinformation, and influencing decisions. Current
language models often fall short in generalizing beyond their training sets. In
response, we introduce the Contextualized Bi-Directional Dual Transformer
(CBDT) Classifier. This novel architecture utilizes two synergistic transformer
networks: the Context Transformer and the Entity Transformer, aiming for
enhanced bias detection. Our dataset preparation follows the FAIR principles,
ensuring ethical data usage. Through rigorous testing on various datasets, CBDT
showcases its ability in distinguishing biased from neutral statements, while
also pinpointing exact biased lexemes. Our approach outperforms existing
methods, achieving a 2-4\% increase over benchmark performances. This opens
avenues for adapting the CBDT model across diverse linguistic and cultural
landscapes.Comment: UNDER REVIE
- …