223 research outputs found
EPIE Dataset: A Corpus For Possible Idiomatic Expressions
Idiomatic expressions have always been a bottleneck for language
comprehension and natural language understanding, specifically for tasks like
Machine Translation(MT). MT systems predominantly produce literal translations
of idiomatic expressions as they do not exhibit generic and linguistically
deterministic patterns which can be exploited for comprehension of the
non-compositional meaning of the expressions. These expressions occur in
parallel corpora used for training, but due to the comparatively high
occurrences of the constituent words of idiomatic expressions in literal
context, the idiomatic meaning gets overpowered by the compositional meaning of
the expression. State of the art Metaphor Detection Systems are able to detect
non-compositional usage at word level but miss out on idiosyncratic phrasal
idiomatic expressions. This creates a dire need for a dataset with a wider
coverage and higher occurrence of commonly occurring idiomatic expressions, the
spans of which can be used for Metaphor Detection. With this in mind, we
present our English Possible Idiomatic Expressions(EPIE) corpus containing
25206 sentences labelled with lexical instances of 717 idiomatic expressions.
These spans also cover literal usages for the given set of idiomatic
expressions. We also present the utility of our dataset by using it to train a
sequence labelling module and testing on three independent datasets with high
accuracy, precision and recall scores
PARSEME Survey on MWE Resources
International audienceThis paper summarizes the first results of an ongoing survey on multiword resources carried out within the IC1207 Cost ActionPARSEME (PARSing and Multi-word Expressions). Despite the availability of language resource catalogues and the inventory ofmultiword data-sets available at the SIGLEX-MWE website, multiword resources are scattered and prove to be difficult to be found.In many cases, language resources such as corpora, treebanks or lexical databases include multiwords as part of their data or take theminto consideration in their annotations. However, it is needed to centralize these resources so that other researches may subsequentlyuse them. The final aim of this survey is thus to create a portal where researchers may find multiword resources or multiword-awarelanguage resources for their research. We report on how the survey was designed and analyze the data gathered so far. We also discussthe problems we have detected upon examination of the data and possible ways of enhancing the survey
Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods
Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay <N+N> compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (T-test , Piatersky-Shapiro (PS) , C_value, FGM and rank combination method) are the best association measure and outperforms the other association measures for <N+N> collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score
SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages
This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, VÔro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Ashåninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.Peer reviewe
SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages
This yearâs iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, VĂ”ro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, AshĂĄninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systemsâ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving \u3e90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systemsâ performance on previously unseen lemmas
Multiword expressions
Multiword expressions (MWEs) are a challenge for both the natural language applications and the linguistic theory because they often defy the application of the machinery developed for free combinations where the default is that the meaning of an utterance can be predicted from its structure. There is a rich body of primarily descriptive work on MWEs for many European languages but comparative work is little. The volume brings together MWE experts to explore the benefits of a multilingual perspective on MWEs. The ten contributions in this volume look at MWEs in Bulgarian, English, French, German, Maori, Modern Greek, Romanian, Serbian, and Spanish. They discuss prominent issues in MWE research such as classification of MWEs, their formal grammatical modeling, and the description of individual MWE types from the point of view of different theoretical frameworks, such as Dependency Grammar, Generative Grammar, Head-driven Phrase Structure Grammar, Lexical Functional Grammar, Lexicon Grammar
- âŠ