1,313 research outputs found
Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization
Translation alignment is an essential task in Digital Humanities and Natural
Language Processing, and it aims to link words/phrases in the source
text with their translation equivalents in the translation. In addition to
its importance in teaching and learning historical languages, translation
alignment builds bridges between ancient and modern languages through
which various linguistics annotations can be transferred. This thesis focuses
on word-level translation alignment applied to historical languages in general
and Ancient Greek and Latin in particular. As the title indicates, the thesis
addresses four interdisciplinary aspects of translation alignment.
The starting point was developing Ugarit, an interactive annotation tool
to perform manual alignment aiming to gather training data to train an
automatic alignment model. This effort resulted in more than 190k accurate
translation pairs that I used for supervised training later. Ugarit has been
used by many researchers and scholars also in the classroom at several
institutions for teaching and learning ancient languages, which resulted
in a large, diverse crowd-sourced aligned parallel corpus allowing us to
conduct experiments and qualitative analysis to detect recurring patterns in
annotators’ alignment practice and the generated translation pairs.
Further, I employed the recent advances in NLP and language modeling to
develop an automatic alignment model for historical low-resourced languages,
experimenting with various training objectives and proposing a training
strategy for historical languages that combines supervised and unsupervised
training with mono- and multilingual texts. Then, I integrated this alignment
model into other development workflows to project cross-lingual annotations
and induce bilingual dictionaries from parallel corpora.
Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined
its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold
standard datasets and support quantitative and qualitative evaluation of
translation alignment models. Besides, I designed and implemented visual
analytics tools and reading environments for parallel texts and proposed
various visualization approaches to support different alignment-related tasks
employing the latest advances in information visualization and best practice.
Overall, this thesis presents a comprehensive study that includes manual and
automatic alignment techniques, evaluation methods and visual analytics
tools that aim to advance the field of translation alignment for historical
languages
many faces, many places (Term21)
UIDB/03213/2020
UIDP/03213/2020publishersversionpublishe
many faces, many places (Term21)
UIDB/03213/2020
UIDP/03213/2020Proceedings of the LREC 2022 Workshop Language Resources and Evaluation Conferencepublishersversionpublishe
Language technologies for a multilingual Europe
This volume of the series “Translation and Multilingual Natural Language Processing” includes most of the papers presented at the Workshop “Language Technology for a Multilingual Europe”, held at the University of Hamburg on September 27, 2011 in the framework of the conference GSCL 2011 with the topic “Multilingual Resources and Multilingual Applications”, along with several additional contributions. In addition to an overview article on Machine Translation and two contributions on the European initiatives META-NET and Multilingual Web, the volume includes six full research articles. Our intention with this workshop was to bring together various groups concerned with the umbrella topics of multilingualism and language technology, especially multilingual technologies. This encompassed, on the one hand, representatives from research and development in the field of language technologies, and, on the other hand, users from diverse areas such as, among others, industry, administration and funding agencies. The Workshop “Language Technology for a Multilingual Europe” was co-organised by the two GSCL working groups “Text Technology” and “Machine Translation” (http://gscl.info) as well as by META-NET (http://www.meta-net.eu)
Language-Independent Methods for Identifying Cross-Lingual Similarity in Wikipedia
The diversity and richness of multilingual information available in Wikipedia have increased its significance as a language resource. The information extracted from Wikipedia has been utilised for many tasks, such as Statistical Machine Translation (SMT) and supporting multilingual information access. These tasks often rely on gathering data from articles that describe the same topic in different languages with the assumption that the contents are equivalent to each other. However, studies have shown that this might not be the case.
Given the scale and use of Wikipedia, there is a need to develop an approach to measure cross-lingual similarity across Wikipedia. Many existing similarity measures, however, require the availability of "language-dependent" resources, such as dictionaries or Machine Translation (MT) systems, to translate documents into the same language prior to comparison. This presents some challenges for some language pairs, particularly those involving "under-resourced" languages where the required linguistic resources are not widely available. This study aims to present a solution to this problem by first, investigating cross-lingual similarity in Wikipedia, and secondly, developing "language-independent" approaches to measure cross-lingual similarity in Wikipedia.
Two main contributions were provided in this work to identify cross-lingual similarity in Wikipedia. The first key contribution of this work is the development of a Wikipedia similarity corpus to understand the similarity characteristics of Wikipedia articles and to evaluate and compare various approaches for measuring cross-lingual similarity. The author elicited manual judgments from people with the appropriate language skills to assess similarities between a set of 800 pairs of interlanguage-linked articles. This corpus contains Wikipedia articles for eight language pairs (all pairs involving English and including well-resourced and under-resourced languages) of varying degrees of similarity.
The second contribution of this work is the development of language-independent approaches to measure cross-lingual similarity in Wikipedia. The author investigated the utility of a number of "lightweight" language-independent features in four different experiments. The first experiment investigated the use of Wikipedia links to identify and align similar sentences, prior to aggregating the scores of the aligned sentences to represent the similarity of the document pair. The second experiment investigated the usefulness of content similarity features (such as char-n-gram overlap, links overlap, word overlap and word length ratio). The third experiment focused on analysing the use of structure similarity features (such as the ratio of section length, and similarity between the section headings). And finally, the fourth experiment investigates a combination of these features in a classification and a regression approach.
Most of these features are language-independent whilst others utilised freely available resources (Wikipedia and Wiktionary) to assist in identifying overlapping information across languages. The approaches proposed are lightweight and can be applied to any languages written in Latin script; non-Latin script languages need to be transliterated prior to using these approaches. The performances of these approaches were evaluated against the human judgments in the similarity corpus.
Overall, the proposed language-independent approaches achieved promising results. The best performance is achieved with the combination of all features in a classification and a regression approach. The results show that the Random Forest classifier was able to classify 81.38% document pairs correctly (F1 score=0.79) in a binary classification problem, 50.88% document pairs correctly (F1 score=0.71) in a 5-class classification problem, and RMSE of 0.73 in a regression approach. These results are significantly higher compared to a classifier utilising machine translation and cosine similarity of the tf-idf scores. These findings showed that language-independent approaches can be used to measure cross-lingual similarity between Wikipedia articles. Future work is needed to evaluate these approaches in more languages and to incorporate more features
Resourcing machine translation with parallel treebanks
The benefits of syntax-based approaches to data-driven machine translation (MT) are clear: given the right model, a combination of hierarchical structure, constituent labels and morphological information can be exploited to produce more fluent, grammatical translation output. This has been demonstrated by the recent shift in research focus towards such linguistically motivated approaches. However, one issue facing developers of such models that is not encountered in the development of state-of-the-art string-based statistical MT (SMT) systems is the lack of available syntactically annotated training data for many languages.
In this thesis, we propose a solution to the problem of limited resources for syntax-based MT by introducing a novel sub-sentential alignment algorithm for the induction of translational equivalence links between pairs of phrase structure trees. This algorithm, which operates on a language pair-independent basis, allows for the automatic generation of large-scale parallel treebanks which are useful not only for machine translation, but also across a variety of natural language processing tasks. We demonstrate the viability of our automatically generated parallel treebanks by means of a thorough evaluation process during which they are compared to a manually annotated gold standard parallel treebank both intrinsically and in an MT task.
Following this, we hypothesise that these parallel treebanks are not only useful in syntax-based MT, but also have the potential to be exploited in other paradigms of MT. To this end, we carry out a large number of experiments across a variety of data sets and language pairs, in which we exploit the information encoded within the parallel treebanks in various components of phrase-based statistical MT systems. We demonstrate that improvements in translation accuracy can be achieved by enhancing SMT phrase tables with linguistically motivated phrase pairs extracted from a parallel treebank, while showing that a number of other features in SMT can also be supplemented with varying degrees of effectiveness. Finally, we examine ways in which synchronous grammars extracted from parallel treebanks can improve the quality of translation output, focussing on real translation examples from a syntax-based MT system
Representation and parsing of multiword expressions
This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches
Current trends
Deep parsing is the fundamental process aiming at the representation of the syntactic
structure of phrases and sentences. In the traditional methodology this process is
based on lexicons and grammars representing roughly properties of words and interactions
of words and structures in sentences. Several linguistic frameworks, such as Headdriven
Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining
Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different
structures and combining operations for building grammar rules. These already contain
mechanisms for expressing properties of Multiword Expressions (MWE), which, however,
need improvement in how they account for idiosyncrasies of MWEs on the one
hand and their similarities to regular structures on the other hand. This collaborative
book constitutes a survey on various attempts at representing and parsing MWEs in the
context of linguistic theories and applications
- …