631 research outputs found
Solving headswitching translation cases in LFG-DOT
It has been shown that LFG-MT (Kaplan et al., 1989) has difficulties with Headswitching data (Sadler et al., 1989, 1990; Sadler & Thompson, 1991). We revisit these arguments in this paper. Despite attempts at solving these problematic constructions using approaches based on linear logic (Van Genabith et al., 1998) and restriction (Kaplan & Wedekind, 1993), we point out further problems which are introduced.
We then show how LFG-DOP (Bod & Kaplan, 1998) can be extended to serve as a novel hybrid model for MT, LFG-DOT (Way, 1999, 2001), which promises to improve upon the DOT model of translation (Poutsma 1998, 2000) as well as LFG-MT. LFG-DOT improves the robustness of LFG-MT through the use of the LFG-DOP Discard operator, which produces generalized fragments by discarding certain f-structure features. LFG-DOT can, therefore, deal with ill-formed or previously unseen input where LFG-MT cannot. Finally, we demonstrate that LFG-DOT can cope with such translational phenomena which prove problematic for other LFG-based models of translation
Learning labelled dependencies in machine translation evaluation
Recently novel MT evaluation metrics have been presented which go beyond pure string matching, and which correlate
better than other existing metrics with human judgements. Other research in this area has presented machine learning
methods which learn directly from human judgements. In this paper, we present a novel combination of dependency- and
machine learning-based approaches to automatic MT evaluation, and demonstrate greater correlations with human judgement than the existing state-of-the-art methods.
In addition, we examine the extent to which our novel method can be generalised across different tasks and domains
Joining hands: developing a sign language machine translation system with and for the deaf community
This paper discusses the development of an automatic machine translation (MT) system for translating spoken language text into signed languages (SLs). The motivation for our work is the improvement of accessibility to airport information announcements for D/deaf and hard of hearing people. This paper demonstrates the involvement of Deaf colleagues and members of the D/deaf community in Ireland in three areas of our research: the choice of a domain for automatic translation that has a practical use for the D/deaf community; the human translation of English text into Irish Sign Language (ISL) as well as advice on ISL grammar and linguistics; and the importance of native ISL signers as manual evaluators of our translated output
Lost in translation: the problems of using mainstream MT evaluation metrics for sign language translation
In this paper we consider the problems of applying corpus-based techniques to minority languages that are neither politically recognised nor have a formally accepted writing system, namely sign languages. We discuss the adoption of an annotated form of sign language data as a suitable corpus for the development of a data-driven machine translation (MT) system, and deal with issues that arise from its use. Useful software tools that facilitate easy annotation of video data are also discussed. Furthermore, we address the problems of using traditional MT evaluation metrics for sign language translation. Based on the candidate translations produced from our example-based machine translation system, we discuss why standard metrics fall short of providing an accurate evaluation and suggest more suitable evaluation methods
A three-pass system combination framework by combining multiple hypothesis alignment methods
So far, many effective hypothesis alignment metrics have been proposed and applied to the system combination, such as TER, HMM, ITER and IHMM. In addition, the Minimum Bayes-risk (MBR) decoding and the confusion network (CN) have become the state-of-the art techniques in system combination. In this paper, we present a three-pass system combination strategy that can combine hypothesis alignment results derived from different alignment metrics to generate a better translation. Firstly the different alignment metrics are carried out to align the backbone and hypotheses, and the individual CN is built corresponding to each alignment results; then we construct a super network by merging the multiple metric-based CN and generate a consensus output. Finally a modified consensus network MBR (ConMBR) approach is employed to search a best translation. Our proposed strategy out performs the best single CN as well as the best single system in our experiments on NIST Chinese-to-English test set
Data-oriented parsing and the Penn Chinese treebank
We present an investigation into parsing the Penn Chinese Treebank using a Data-Oriented Parsing (DOP) approach. DOP
comprises an experience-based approach to natural language parsing. Most published research in the DOP framework uses PStrees as its representation schema. Drawbacks of the DOP approach centre around issues of efficiency. We incorporate recent advances in DOP parsing techniques into a novel DOP parser which generates a compact representation of all subtrees which can be derived from any full parse tree.
We compare our work to previous work on parsing the Penn Chinese Treebank, and provide both a quantitative and qualitative evaluation. While our results in terms of Precision and Recall are slightly below those published in related research, our approach requires no manual encoding of head rules, nor is a development phase per se necessary.
We also note that certain constructions which were problematic in this previous work can be handled correctly by our DOP parser. Finally, we observe that the ‘DOP Hypothesis’ is confirmed for parsing the Penn Chinese Treebank
Hybrid example-based SMT: the best of both worlds?
(Way and Gough, 2005) provide an indepth comparison of their Example-Based Machine Translation (EBMT) system with
a Statistical Machine Translation (SMT) system constructed from freely available tools. According to a wide variety of automatic evaluation metrics, they demonstrated
that their EBMT system outperformed the SMT system by a factor of two to one.
Nevertheless, they did not test their EBMT system against a phrase-based SMT system. Obtaining their training and test
data for English–French, we carry out a number of experiments using the Pharaoh SMT Decoder. While better results are seen when Pharaoh is seeded with Giza++
word- and phrase-based data compared to EBMT sub-sentential alignments, in general better results are obtained when combinations of this 'hybrid' data is used to construct the translation and probability models. While for the most part the EBMT system of (Gough & Way, 2004b) outperforms any flavour of the phrasebased SMT systems constructed in our
experiments, combining the data sets automatically induced by both Giza++ and their EBMT system leads to a hybrid system which improves on the EBMT system per se for French–English
Bilingually motivated domain-adapted word segmentation for statistical machine translation
We introduce a word segmentation approach to languages where word boundaries are not orthographically marked,
with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is
adapted for the specific translation task at hand by taking the corresponding source (target) language into account. Secondly, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and
demonstrate that our approach scores consistently among the best results across different data conditions
Controlled generation in example-based machine translation
The theme of controlled translation is currently in vogue in the area of MT. Recent research (Sch¨aler et al., 2003;
Carl, 2003) hypothesises that EBMT systems are perhaps best suited to this challenging task. In this paper, we present
an EBMT system where the generation of the target string is filtered by data written according to controlled language
specifications. As far as we are aware, this is the only research available on this topic. In the field of controlled language applications, it is more usual to constrain the source language in this way rather than the target. We translate a small corpus of controlled English into French using the on-line MT system Logomedia, and seed the memories of our EBMT system with a set of automatically induced lexical resources using the Marker Hypothesis as a segmentation tool. We test our system on a large set of sentences extracted from a Sun Translation Memory, and provide both an automatic and a human evaluation. For comparative purposes, we also provide results for Logomedia itself
Using TERp to augment the system combination for SMT
TER-Plus (TERp) is an extended TER evaluation metric incorporating morphology, synonymy and paraphrases.
There are three new edit operations in TERp: Stem Matches, Synonym Matches and Phrase Substitutions (Para-phrases). In this paper, we propose a TERp-based augmented system combination in terms of the backbone selection and consensus decoding network. Combining the new properties\ud
of the TERp, we also propose a two-pass decoding strategy for the lattice-based phrase-level confusion network(CN) to generate the final result. The experiments conducted on the NIST2008 Chinese-to-English test set show that our TERp-based augmented system combination framework achieves significant improvements in terms of BLEU and TERp scores compared to the state-of-the-art word-level system combination framework and a TER-based combination strategy
- …