11 research outputs found
Multiword expressions at length and in depth
The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution
Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding
Understanding and Enhancing the Use of Context for Machine Translation
To understand and infer meaning in language, neural models have to learn
complicated nuances. Discovering distinctive linguistic phenomena from data is
not an easy task. For instance, lexical ambiguity is a fundamental feature of
language which is challenging to learn. Even more prominently, inferring the
meaning of rare and unseen lexical units is difficult with neural networks.
Meaning is often determined from context. With context, languages allow meaning
to be conveyed even when the specific words used are not known by the reader.
To model this learning process, a system has to learn from a few instances in
context and be able to generalize well to unseen cases. The learning process is
hindered when training data is scarce for a task. Even with sufficient data,
learning patterns for the long tail of the lexical distribution is challenging.
In this thesis, we focus on understanding certain potentials of contexts in
neural models and design augmentation models to benefit from them. We focus on
machine translation as an important instance of the more general language
understanding problem. To translate from a source language to a target
language, a neural model has to understand the meaning of constituents in the
provided context and generate constituents with the same meanings in the target
language. This task accentuates the value of capturing nuances of language and
the necessity of generalization from few observations. The main problem we
study in this thesis is what neural machine translation models learn from data
and how we can devise more focused contexts to enhance this learning. Looking
more in-depth into the role of context and the impact of data on learning
models is essential to advance the NLP field. Moreover, it helps highlight the
vulnerabilities of current neural networks and provides insights into designing
more robust models.Comment: PhD dissertation defended on November 10th, 202
Automatic identification and translation of multiword expressions
A thesis submitted in partial fulfilment of the requirements of the
University of Wolverhampton for the degree of Doctor of Philosophy.Multiword Expressions (MWEs) belong to a class of phraseological phenomena
that is ubiquitous in the study of language. They are heterogeneous
lexical items consisting of more than one word and feature lexical, syntactic,
semantic and pragmatic idiosyncrasies. Scholarly research on MWEs benefits
both natural language processing (NLP) applications and end users.
This thesis involves designing new methodologies to identify and translate
MWEs. In order to deal with MWE identification, we first develop datasets
of annotated verb-noun MWEs in context. We then propose a method which
employs word embeddings to disambiguate between literal and idiomatic usages
of the verb-noun expressions. Existence of expression types with various
idiomatic and literal distributions leads us to re-examine their modelling and
evaluation. We propose a type-aware train and test splitting approach to
prevent models from overfitting and avoid misleading evaluation results.
Identification of MWEs in context can be modelled with sequence tagging
methodologies. To this end, we devise a new neural network architecture,
which is a combination of convolutional neural networks and long-short
term memories with an optional conditional random field layer on top. We
conduct extensive evaluations on several languages demonstrating a better
performance compared to the state-of-the-art systems. Experiments show that the generalisation power of the model in predicting unseen MWEs is significantly better than previous systems.
In order to find translations for verb-noun MWEs, we propose a bilingual
distributional similarity approach derived from a word embedding model that
supports arbitrary contexts. The technique is devised to extract translation
equivalents from comparable corpora which are an alternative resource to
costly parallel corpora. We finally conduct a series of experiments to investigate
the effects of size and quality of comparable corpora on automatic
extraction of translation equivalents
Extended papers from the MWE 2017 workshop
The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide.
This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
A Hybrid Machine Translation Framework for an Improved Translation Workflow
Over the past few decades, due to a continuing surge in the amount of content being translated and ever increasing pressure to deliver high quality and high throughput translation, translation industries are focusing their interest on adopting advanced technologies such as machine translation (MT), and automatic post-editing (APE) in their translation workflows. Despite the progress of the technology, the roles of humans and machines essentially remain intact as MT/APE are moving from the peripheries of the translation field closer towards collaborative human-machine based MT/APE in modern translation workflows. Professional translators increasingly become post-editors correcting raw MT/APE output instead of translating from scratch which in turn increases productivity in terms of translation speed. The last decade has seen substantial growth in research and development activities on improving MT; usually concentrating on selected aspects of workflows starting from training data pre-processing techniques to core MT processes to post-editing methods. To date, however, complete MT workflows are less investigated than the core MT processes. In the research presented in this thesis, we investigate avenues towards achieving improved MT workflows. We study how different MT paradigms can be utilized and integrated to best effect. We also investigate how different upstream and downstream component technologies can be hybridized to achieve overall improved MT. Finally we include an investigation into human-machine collaborative MT by taking humans in the loop. In many of (but not all) the experiments presented in this thesis we focus on data scenarios provided by low resource language settings.Aufgrund des stetig ansteigenden Übersetzungsvolumens in den letzten Jahrzehnten und
gleichzeitig wachsendem Druck hohe Qualität innerhalb von kürzester Zeit liefern zu
müssen sind Übersetzungsdienstleister darauf angewiesen, moderne Technologien wie
Maschinelle Übersetzung (MT) und automatisches Post-Editing (APE) in den Übersetzungsworkflow
einzubinden. Trotz erheblicher Fortschritte dieser Technologien haben
sich die Rollen von Mensch und Maschine kaum verändert. MT/APE ist jedoch nunmehr
nicht mehr nur eine Randerscheinung, sondern wird im modernen Übersetzungsworkflow
zunehmend in Zusammenarbeit von Mensch und Maschine eingesetzt. Fachübersetzer
werden immer mehr zu Post-Editoren und korrigieren den MT/APE-Output, statt wie
bisher Übersetzungen komplett neu anzufertigen. So kann die Produktivität bezüglich
der Übersetzungsgeschwindigkeit gesteigert werden. Im letzten Jahrzehnt hat sich in den
Bereichen Forschung und Entwicklung zur Verbesserung von MT sehr viel getan: Einbindung
des vollständigen Übersetzungsworkflows von der Vorbereitung der Trainingsdaten
über den eigentlichen MT-Prozess bis hin zu Post-Editing-Methoden. Der vollständige
Übersetzungsworkflow wird jedoch aus Datenperspektive weit weniger berücksichtigt
als der eigentliche MT-Prozess. In dieser Dissertation werden Wege hin zum
idealen oder zumindest verbesserten MT-Workflow untersucht. In den Experimenten
wird dabei besondere Aufmertsamfit auf die speziellen Belange von sprachen mit geringen
ressourcen gelegt. Es wird untersucht wie unterschiedliche MT-Paradigmen verwendet
und optimal integriert werden können. Des Weiteren wird dargestellt wie unterschiedliche
vor- und nachgelagerte Technologiekomponenten angepasst werden können, um insgesamt
einen besseren MT-Output zu generieren. Abschließend wird gezeigt wie der Mensch in
den MT-Workflow intergriert werden kann. Das Ziel dieser Arbeit ist es verschiedene
Technologiekomponenten in den MT-Workflow zu integrieren um so einen verbesserten
Gesamtworkflow zu schaffen. Hierfür werden hauptsächlich Hybridisierungsansätze verwendet.
In dieser Arbeit werden außerdem Möglichkeiten untersucht, Menschen effektiv
als Post-Editoren einzubinden
Bibliographie annuelle: recherche suisse sur le plurilinguisme 2014
Notre Bibliographie annuelle de la recherche suisse sur le plurilinguisme contient une sélection de publications consacrées au plurilinguisme. Ce numéro de la bibliographie comprend des publications parues en 2015. Il contient des informations sur des articles de revues, des chapitres de livres, des monographies, des volumes collectifs et des documents en ligne publiés par des chercheurs et chercheuses d’institutions suisses ainsi que des travaux de chercheurs internationaux parus dans certaines revues spécialisées. La bibliographie recense des publications dans les langues nationales suisses ainsi qu’en anglais.The Annual Bibliography of Swiss Research on Multilingualism contains a selection of scholarly publications in the disciplines of linguistics, sociology, pedagogy and other fields related to multilingualism. This issue contains bibliographic information on publications from the year 2015. Includes articles from journals, book chapters, monographs, anthologies and online documents by researchers at Swiss institutions. We have also included publications which international researchers have contributed to Swiss journals. The bibliography catalogues publications in Switzerland’s official languages and in English.Unsere Jahresbibliographie Schweizer Mehrsprachigkeitsforschung enthält eine Auswahl der linguistischen, soziologischen, erziehungswissenschaftlichen und anderweitig dem Themenkomplex Mehrsprachigkeit gewidmeten wissenschaftlichen Literatur. Diese Ausgabe enthält bibliographische Angaben zu Veröffentlichungen aus dem Jahr 2015. In die Bibliographie werden Zeitschriftenaufsätze, Buchkapitel, Monographien, Sammelwerke und Online-Dokumente von Forscherinnen und Forschern an Schweizer Institutionen sowie Publikationen internationaler Forscher/innen in einigen Schweizer Fachzeitschriften aufgenommen. Berücksichtigt werden Veröffentlichungen in den Landessprachen der Schweiz sowie in englischer Sprache.La nostra Bibliografia annuale della ricerca svizzera sul plurilinguismo contiene una selezione delle pubblicazioni consacrate al plurilinguismo apparse in linguistica, sociologia, scienze dell’educazione o in altre discipline. Questo numero della bibliografia contiene le pubblicazioni apparse nel 2015. Esso include articoli di riviste, capitoli di libri, monografie, opere collettive e documenti digitali pubblicati da ricercatrici e ricercatori d’istituzioni svizzere, oltre a lavori di ricercatori internazionali apparsi in alcune riviste specializzate. La bibliografia censisce pubblicazioni nelle lingue nazionali svizzere e in inglese