Search CORE

223 research outputs found

EPIE Dataset: A Corpus For Possible Idiomatic Expressions

Author: Paul Soma
Saxena Prateek
Publication venue
Publication date: 16/06/2020
Field of study

Idiomatic expressions have always been a bottleneck for language comprehension and natural language understanding, specifically for tasks like Machine Translation(MT). MT systems predominantly produce literal translations of idiomatic expressions as they do not exhibit generic and linguistically deterministic patterns which can be exploited for comprehension of the non-compositional meaning of the expressions. These expressions occur in parallel corpora used for training, but due to the comparatively high occurrences of the constituent words of idiomatic expressions in literal context, the idiomatic meaning gets overpowered by the compositional meaning of the expression. State of the art Metaphor Detection Systems are able to detect non-compositional usage at word level but miss out on idiosyncratic phrasal idiomatic expressions. This creates a dire need for a dataset with a wider coverage and higher occurrence of commonly occurring idiomatic expressions, the spans of which can be used for Metaphor Detection. With this in mind, we present our English Possible Idiomatic Expressions(EPIE) corpus containing 25206 sentences labelled with lexical instances of 717 idiomatic expressions. These spans also cover literal usages for the given set of idiomatic expressions. We also present the utility of our dataset by using it to train a sequence labelling module and testing on three independent datasets with high accuracy, precision and recall scores

arXiv.org e-Print Archive

Crossref

PARSEME Survey on MWE Resources

Author: Bargmann S.
Losnegaard G. S.
Monti Johanna
Parra Escartın C.
Sangati F.
Savary A.
Publication venue
Publication date: 01/01/2016
Field of study

International audienceThis paper summarizes the first results of an ongoing survey on multiword resources carried out within the IC1207 Cost ActionPARSEME (PARSing and Multi-word Expressions). Despite the availability of language resource catalogues and the inventory ofmultiword data-sets available at the SIGLEX-MWE website, multiword resources are scattered and prove to be difficult to be found.In many cases, language resources such as corpora, treebanks or lexical databases include multiwords as part of their data or take theminto consideration in their annotations. However, it is needed to centralize these resources so that other researches may subsequentlyuse them. The final aim of this survey is thus to create a portal where researchers may find multiword resources or multiword-awarelanguage resources for their research. We report on how the survey was designed and analyze the data gathered so far. We also discussthe problems we have detected upon examination of the data and possible ways of enhancing the survey

Università degli Studi di Napoli L'Orientale: CINECA IRIS

HAL Université de Tours

Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods

Author: A. S. Hazaa Muneer
Albared Mohammed
Ba-Alwi Fadl Mutaher
Omar Nazlia
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/06/2016
Field of study

Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay <N+N> compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (T-test , Piatersky-Shapiro (PS) , C_value, FGM and rank combination method) are the best association measure and outperforms the other association measures for <N+N> collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score

IAES journal

Crossref

Institute of Advanced Engineering and Science

SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

Author: Aiton Grant
Ambridge Ben
Ataman Duygu
Ate Yustinus Ghanggo
Barta Botond
Bayyr-ool Aziyana
Bernardy Jean-Philippe
Chodroff Eleanor
Coler Matt
Cotterell Ryan
Ek Adam
El-Khaissi Charbel
Ganieva Sofya
Gasser Michael
Goldman Omer
Habash Nizar
Hatcher Richard J.
Hulden Mans
Ivanova Sardana
Khalifa Salam
Kieraś Witold
Klyachko Elena
Krizhanovsky Andrew
Krizhanovsky Natalia
Kumar Ritesh
Lakatos Dorina
Lane William
Leonard Brian
Liu Zoey
Mielke Sabrina J.
Montoya Samame Jaime Rafael
Nicolai Garett
Nuriah Zahroh
Oncevay Arturo
Pimentel Tiago
Plugaryov Matvey
Ponti Edoardo M.
Prud'hommeaux Emily
Raj Mohit
Ratan Shyam
Ryskina Maria
Salchak Aelita
Salehi Ali
Shcherbakov Andrey
Sheifer Karina
Silva Villegas Gema Celeste
Stoehr Niklas
Straughn Christopher
Suhardijanto Totok
Szolnok Gábor
Tyers Francis M.
Vania Clara
Vylomova Ekaterina
Washington Jonathan
Woliński Marcin
Wu Shijie
Yarowsky David
Ács Judit
Publication venue: The Association for Computational Linguistics
Publication date: 01/08/2021
Field of study

This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.Peer reviewe

Edinburgh Research Explorer

Helsingin yliopiston digitaalinen arkisto

SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

Author: Pimentel Tiago
Ryskina Maria
Straughn Christopher
Publication venue: NEIU Digital Commons
Publication date: 01/01/2021
Field of study

This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving \u3e90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas

NEIU Digital Commons (Northeastern Illinois University)

Multiword expressions

Author
Publication venue
Publication date
Field of study

Multiword expressions (MWEs) are a challenge for both the natural language applications and the linguistic theory because they often defy the application of the machinery developed for free combinations where the default is that the meaning of an utterance can be predicted from its structure. There is a rich body of primarily descriptive work on MWEs for many European languages but comparative work is little. The volume brings together MWE experts to explore the benefits of a multilingual perspective on MWEs. The ten contributions in this volume look at MWEs in Bulgarian, English, French, German, Maori, Modern Greek, Romanian, Serbian, and Spanish. They discuss prominent issues in MWE research such as classification of MWEs, their formal grammatical modeling, and the description of individual MWE types from the point of view of different theoretical frameworks, such as Dependency Grammar, Generative Grammar, Head-driven Phrase Structure Grammar, Lexical Functional Grammar, Lexicon Grammar

OAPEN Library