1,339 research outputs found
Mixed up with machine Translation: Multi-word Units Disambiguation Challenge.
With the rapid evolution of the Internet, translation has become part of the daily life of ordinary users, not only of professional translators. Machine translation has evolved along with different types of computer-assisted translation tools. Qualitative progress has been made in the field of machine translation, but not all problems have been solved. The current times are auspicious for the development of more sophisticated evaluation tools that measure the performance of specific linguistic phenomena. One problem in particular, namely the poor analysis and translation of multi-word units, is an arena where investment in linguistic knowledge systems with the goal of improving machine translation would be beneficial. This paper addresses the difficulties multi-word units present to machine translation, by comparing translations performed by systems adopting different approaches to machine translation. It proposes a solution for improving the quality of the translation of multi-word units by adopting a methodology that combines Lexicon Grammar resources with OpenLogos lexical resources and semantico-syntactic rules. Finally, it discusses how an ideal machine translation evaluation tool might look to correctly evaluate the performance of machine translation engines with regards to multi-word units and thus to contribute to the improvement of translation quality
Linguistic evaluation of support verb constructions by OpenLogos and google translate
This paper presents a systematic human evaluation of translations of English support verb constructions produced by a rule-based machine translation (RBMT) system (OpenLogos) and a statistical machine translation (SMT) system (Google Translate) for five languages: French, German, Italian, Portuguese and Spanish. We classify support verb constructions by means of their syntactic structure and semantic behavior and present a qualitative analysis of their translation errors. The study aims to verify how machine translation (MT) systems translate fine-grained linguistic phenomena, and how well-equipped they are to produce high-quality translation. Another goal of the linguistically motivated quality analysis of SVC raw output is to reinforce the need for better system hybridization, which leverages the strengths of RBMT to the benefit of SMT, especially in improving the translation of multiword units. Taking multiword units into account, we propose an effective method to achieve MT hybridization based on the integration of semantico-syntactic knowledge into SMT.info:eu-repo/semantics/acceptedVersio
D6.1: Technologies and Tools for Lexical Acquisition
This report describes the technologies and tools to be used for Lexical Acquisition in PANACEA. It includes descriptions of existing technologies and tools which can be built on and improved within PANACEA, as well as of new technologies and tools to be developed and integrated in PANACEA platform. The report also specifies the Lexical Resources to be produced. Four main areas of lexical acquisition are included: Subcategorization frames (SCFs), Selectional Preferences (SPs), Lexical-semantic Classes (LCs), for both nouns and verbs, and Multi-Word Expressions (MWEs)
Contractions: to align or not to align, that is the question
This paper performs a detailed analysis on the alignment of Portuguese contractions, based on a previously aligned bilingual corpus. The alignment task was performed manually in a subset of the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpus was pre-processed and a decision was made as to whether the contraction should be maintained or decomposed in the alignment. Decomposition was required in the cases in which the two words that have been concatenated, i.e., the preposition and the determiner or pronoun, go in two separate translation alignment pairs (PT - [no seio de] [a União Europeia] EN - [within] [the European Union]). Most contractions required decomposition in contexts where they are positioned at the end of a multiword unit. On the other hand, contractions tend to be maintained when they occur at the beginning or in the middle of the multiword unit, i.e., in the frozen part of the multiword (PT - [no que diz respeito a] EN - [with regard to] or PT - [além disso] EN - [in addition]. A correct alignment of multiwords and phrasal units containing contractions is instrumental for machine translation, paraphrasing, and variety adaptationinfo:eu-repo/semantics/acceptedVersio
An English-Italian MWE dictionary
La traduzione delle polirematiche
richiede la conoscenza del corretto equivalente
nella lingua di arrivo che raramente è il risultato
di una traduzione letterale. Questo
contributo si basa sul presupposto che il corretto
trattamento delle polirematiche in applicazioni
di Trattamento Automatico del Linguaggio
(TAL) ed in particolare di Traduzione
Automatica e nelle tecnologie per la traduzione,
più in generale, richiede un approccio
computazionale che deve essere, almeno in
parte, basato su dati linguistici, ed in particolare
su una descrizione linguistica esplicita
delle polirematiche, mediante l’uso di un dizionario
macchina ed un insieme di regole.
L'ipotesi è che un approccio linguistico può
integrare le metodologie statisticoprobabilistiche
per una corretta identificazione
e traduzione delle polirematiche, poiché risorse
linguistiche quali dizionari macchina e
grammatiche locali ottengono risultati accurati
per gli scopi del TAL. La metodologia
adottata per questa ricerca si basa su (i)
Nooj, un ambiente TAL che permette lo sviluppo
e la sperimentazione di risorse linguistiche,
(ii) un dizionario macchina Inglese-
Italiano di polirematiche, (iii) un insieme di
grammatiche locali. Il dizionario è costituito
principalmente da verbi frasali, verbi supporto,
espressioni idiomatiche e collocazioni inglesi
e contiene diversi tipi di modelli di polirematiche
nonché la loro traduzione in lingua
italiana.The translation of Multiword Expressions
(MWEs) requires the knowledge of
the correct equivalent in the target language
which is hardly ever the result of a literal
translation. This paper is based on the assumption
that the proper treatment of MWEs
in Natural Language Processing (NLP) applications
and in particular in Machine Translation
and Translation technologies calls for a
computational approach which must be, at
least partially, knowledge-based, and in particular
should be grounded on an explicit linguistic
description of MWEs, both using an
electronic dictionary and a set of rules. The
hypothesis is that a linguistic approach can
complement probabilistic methodologies to
help identify and translate MWEs correctly
since hand-crafted and linguisticallymotivated
resources, in the form of electronic
dictionaries and local grammars, obtain accurate
and reliable results for NLP purposes.
The methodology adopted for this research work is based on (i) Nooj, an NLP environment
which allows the development and testing
of the linguistic resources, (ii) an electronic
English-Italian MWE dictionary, (iii) a set
of local grammars. The dictionary mainly
consists of English phrasal verbs, support verb
constructions, idiomatic expressions and collocations
together with their translation in Italian
and contains different types of MWE POS
pattern
Multiword expressions at length and in depth
The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
Multiword expression processing: A survey
Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by "MWE processing," distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives
Taking on new challenges in multi-word unit processing for Machine Translation
This paper discusses the qualitative comparative evaluation performed on the results of two machine translation systems with different approaches to the processing of multi-word units. It proposes a solution for overcoming the difficulties multi-word units present to machine translation by adopting a methodology that combines the lexicon grammar approach with OpenLogos ontology and semantico-syntactic rules. The paper also discusses the importance of a qualitative evaluation metrics to correctly evaluate the performance of machine translation engines with regards to multi-word units
Designing Statistical Language Learners: Experiments on Noun Compounds
The goal of this thesis is to advance the exploration of the statistical
language learning design space. In pursuit of that goal, the thesis makes two
main theoretical contributions: (i) it identifies a new class of designs by
specifying an architecture for natural language analysis in which probabilities
are given to semantic forms rather than to more superficial linguistic
elements; and (ii) it explores the development of a mathematical theory to
predict the expected accuracy of statistical language learning systems in terms
of the volume of data used to train them.
The theoretical work is illustrated by applying statistical language learning
designs to the analysis of noun compounds. Both syntactic and semantic analysis
of noun compounds are attempted using the proposed architecture. Empirical
comparisons demonstrate that the proposed syntactic model is significantly
better than those previously suggested, approaching the performance of human
judges on the same task, and that the proposed semantic model, the first
statistical approach to this problem, exhibits significantly better accuracy
than the baseline strategy. These results suggest that the new class of designs
identified is a promising one. The experiments also serve to highlight the need
for a widely applicable theory of data requirements.Comment: PhD thesis (Macquarie University, Sydney; December 1995), LaTeX
source, xii+214 page
- …