774 research outputs found
An Empirical Study of the Impact of Idioms on Phrase Based Statistical Machine Translation of English to Brazilian-Portuguese
This paper describes an experiment to evaluate the impact of idioms on Statis- tical Machine Translation (SMT) process using the language pair English/Brazilian- Portuguese. Our results show that on sen- tences containing idioms a standard SMT system achieves about half the BLEU score of the same system when applied to sentences that do not contain idioms. We also provide a short error analysis and out- line our planned work to overcome this limitation
Examining the Tip of the Iceberg: A Data Set for Idiom Translation
Neural Machine Translation (NMT) has been widely used in recent years with
significant improvements for many language pairs. Although state-of-the-art NMT
systems are generating progressively better translations, idiom translation
remains one of the open challenges in this field. Idioms, a category of
multiword expressions, are an interesting language phenomenon where the overall
meaning of the expression cannot be composed from the meanings of its parts. A
first important challenge is the lack of dedicated data sets for learning and
evaluating idiom translation. In this paper we address this problem by creating
the first large-scale data set for idiom translation. Our data set is
automatically extracted from a widely used German-English translation corpus
and includes, for each language direction, a targeted evaluation set where all
sentences contain idioms and a regular training corpus where sentences
including idioms are marked. We release this data set and use it to perform
preliminary NMT experiments as the first step towards better idiom translation.Comment: Accepted at LREC 201
Representations of Idioms for Natural Language Processing: Idiom type and token identification, Language Modelling and Neural Machine Translation
An idiom is a multiword expression (MWE) whose meaning is non- compositional, i.e., the meaning of the expression is different from the meaning of its individual components. Idioms are complex construc- tions of language used creatively across almost all text genres. Idioms pose problems to natural language processing (NLP) systems due to their non-compositional nature, and the correct processing of idioms can improve a wide range of NLP systems. Current approaches to idiom processing vary in terms of the amount of discourse history required to extract the features necessary to build representations for the expressions. These features are, in general, stat- istics extracted from the text and often fail to capture all the nuances involved in idiom usage.
We argue in this thesis that a more flexible representations must be used to process idioms in a range of idiom related tasks. We demonstrate that high-dimensional representations allow idiom classifiers to better model the interactions between global and local features and thereby improve the performance of these systems with regard to processing idioms. In support of this thesis we demonstrate that distributed representations of sentences, such as those generated by a Recurrent Neural Network (RNN) greatly reduce the amount of discourse history required to process idioms and that by using those representations a “general” classifier, that can take any expression as input and classify it as either an idiomatic or literal usage, is feasible. We also propose and evaluate a novel technique to add an attention module to a language model in order to bring forward past information in a RNN-based Language Model (RNN-LM). The results of our evaluation experiments demonstrate that this attention module increases the performance of such models in terms of the perplexity achieved when processing idioms. Our analysis also shows that it improves the performance of RNN-LMs on literal language and, at the same time, helps to bridge long-distance dependencies and reduce the number of parameters required in RNN-LMs to achieve state-of-the-art performance. We investigate the adaptation of this novel RNN-LM to Neural Machine Translation (NMT) systems and we show that, despite the mixed results, it improves the translation of idioms into languages that require distant reordering such as German. We also show that these models are suited to small corpora for in-domain translations for language pairs such as English/Brazilian-Portuguese
EPIE Dataset: A Corpus For Possible Idiomatic Expressions
Idiomatic expressions have always been a bottleneck for language
comprehension and natural language understanding, specifically for tasks like
Machine Translation(MT). MT systems predominantly produce literal translations
of idiomatic expressions as they do not exhibit generic and linguistically
deterministic patterns which can be exploited for comprehension of the
non-compositional meaning of the expressions. These expressions occur in
parallel corpora used for training, but due to the comparatively high
occurrences of the constituent words of idiomatic expressions in literal
context, the idiomatic meaning gets overpowered by the compositional meaning of
the expression. State of the art Metaphor Detection Systems are able to detect
non-compositional usage at word level but miss out on idiosyncratic phrasal
idiomatic expressions. This creates a dire need for a dataset with a wider
coverage and higher occurrence of commonly occurring idiomatic expressions, the
spans of which can be used for Metaphor Detection. With this in mind, we
present our English Possible Idiomatic Expressions(EPIE) corpus containing
25206 sentences labelled with lexical instances of 717 idiomatic expressions.
These spans also cover literal usages for the given set of idiomatic
expressions. We also present the utility of our dataset by using it to train a
sequence labelling module and testing on three independent datasets with high
accuracy, precision and recall scores
Un environnement générique et ouvert pour le traitement des expressions polylexicales
The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work
Multiword expressions at length and in depth
The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
- …