Search CORE

13 research outputs found

Syntactic phrase-based statistical machine translation

Author: Hassan Hany
Hearne Mary
Sima'an Khalil
Way Andy
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Phrase-based statistical machine translation (PBSMT) systems represent the dominant approach in MT today. However, unlike systems in other paradigms, it has proven difficult to date to incorporate syntactic knowledge in order to improve translation quality. This paper improves on recent research which uses 'syntactified' target language phrases, by incorporating supertags as constraints to better resolve parse tree fragments. In addition, we do not impose any sentence-length limit, and using a log-linear decoder, we outperform a state-of-the-art PBSMT system by over 1.3 BLEU points (or 3.51% relative) on the NIST 2003 Arabic-English test corpus

Crossref

Irish Universities

DCU Online Research Access Service

International Migration, Integration and Social Cohesion online publications

Disambiguation strategies for data-oriented translation

Author: Hearne Mary
Way Andy
Publication venue
Publication date: 01/01/2006
Field of study

The Data-Oriented Translation (DOT) model { originally proposed in (Poutsma, 1998, 2003) and based on Data-Oriented Parsing (DOP) (e.g. (Bod, Scha, & Sima'an, 2003)) { is best described as a hybrid model of translation as it combines examples, linguistic information and a statistical translation model. Although theoretically interesting, it inherits the computational complexity associated with DOP. In this paper, we focus on one computational challenge for this model: efficiently selecting the `best' translation to output. We present four different disambiguation strategies in terms of how they are implemented in our DOT system, along with experiments which investigate how they compare in terms of accuracy and efficiency

CiteSeerX

Irish Universities

DCU Online Research Access Service

Robust language pair-independent sub-tree alignment

Author: Hearne Mary
Tinsley John
Way Andy
Zhechev Ventsislav
Publication venue: European Association for Machine Translation
Publication date: 01/01/2007
Field of study

Data-driven approaches to machine translation (MT) achieve state-of-the-art results. Many syntax-aware approaches, such as Example-Based MT and Data-Oriented Translation, make use of tree pairs aligned at sub-sentential level. Obtaining sub-sentential alignments manually is time-consuming and error-prone, and requires expert knowledge of both source and target languages. We propose a novel, language pair-independent algorithm which automatically induces alignments between phrase-structure trees. We evaluate the alignments themselves against a manually aligned gold standard, and perform an extrinsic evaluation by using the aligned data to train and test a DOT system. Our results show that translation accuracy is comparable to that of the same translation system trained on manually aligned data, and coverage improves

Irish Universities

DCU Online Research Access Service

Automatic analysis of semantic similarity in comparable text through syntactic tree matching

Author: Krahmer E.J.
Marsi E.C.
Publication venue: Chinese Information Processing Society of China (CIPS)
Publication date: 01/01/2010
Field of study

Tilburg University Repository

Angol-magyar szótáralapú főnévicsoport-szinkronizáció és fordításalapú főnévicsoport-meghatározás

Author: Pohl Gábor
Publication venue
Publication date: 01/01/2005
Field of study

A minta-alapú gépi fordítás (EBMT) alapfeltétele, hogy forrásnyelvi és ezeknek megfelelő célnyelvi mintamondatok mondatnál kisebb szerkezeti egységeit automatikusan egymáshoz tudjuk rendelni. Cikkünkben egy EBMT alapú angol-magyar fordítómemóriához (MetaMorpho TM) kidolgozott főnévicsoport-szinkronizáló algoritmust, valamint egy magyar főnévi csoportok angol megfelelőik alapján történő meghatározására kifejlesztett módszert mutatunk be. A főnévi csoportok szinkronizálása során módszerünk tövesített szótári keresést alkalmazva, hasonló alakú szavakat (cognate), illetve szófaji egyezéseket keresve minden lehetséges főnévicsoport-párhoz kiszámít egy heurisztikus hasonlósági értéket, majd ez alapján dönt az egyes főnévi csoportok egymáshoz rendeléséről. A szintaktikai elemzővel meghatározott angol főnévi csoportok magyar megfelelőinek meghatározására kidolgozott módszerünk magyar szintaktikai elemzőt nem igényel, az angol főnévi csoportok szavait szótár segítségével képezi le a magyar mondat szavaira, majd a lehetséges fedések közül a magyar mondatra legrövidebben illeszkedőt teljes magyar főnévi csoporttá bővíti (a szótárral meg nem feleltetett szavak szófaját is figyelembe véve a bővítés során). Cikkünkben végül az első szinkronizációs eredményeinket is ismertetjük

University of Szeged

Automatic generation of parallel treebanks: an efficient unsupervised system

Author: Zhechev Ventsislav
Publication venue: Dublin City University. School of Computing
Publication date: 01/01/2009
Field of study

The need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. This is true especially for parallel treebanks, of which very few exist. The ones that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this work I introduce a novel open-source platform for the fast and robust automatic generation of parallel treebanks through sub-tree alignment, using a limited amount of external resources. The intrinsic and extrinsic evaluations that I undertook demonstrate that my system is a feasible alternative to the manual annotation of parallel treebanks. Therefore, I expect the presented platform to help boost research in the field of syntaxaugmented machine translation and lead to advancements in other fields where parallel treebanks can be employed

CiteSeerX

Irish Universities

DCU Online Research Access Service

Hybrid data-driven models of machine translation

Author: Groves Declan
Publication venue: Dublin City University. School of Computing
Publication date: 01/01/2007
Field of study

Corpus-based approaches to Machine Translation (MT) dominate the MT research field today, with Example-Based MT (EBMT) and Statistical MT (SMT) representing two different frameworks within the data-driven paradigm. EBMT has always made use of both phrasal and lexical correspondences to produce high-quality translations. Early SMT models, on the other hand, were based on word-level correpsondences, but with the advent of more sophisticated phrase-based approaches, the line between EBMT and SMT has become increasingly blurred. In this thesis we carry out a number of translation experiments comparing the performance of the state-of-the-art marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005) against a phrase-based SMT (PBSMT) system built using the state-of-the-art PHARAOphHra se-based decoder (Koehn, 2004a) and employing standard phrasal extraction in euristics (Koehn et al., 2003). In additin e describe experiments investigating the possibility of combining elements of EBMT and SMT in order to create a hybrid data-driven model of MT capable of outperforming either approach from which it is derived. Making use of training and testlng data taken from a French-Enghsh translation memory of Sun Microsystems computer documentation, we find that while better results are seen when the PBSMT system is seeded with GIZA++ word- and phrasebased data compared to EBMT marker-based sub-sentential alignments, in general improvements are obtained when combinations of this 'hybrid' data are used to construct the translation and probability models. While for the most part the baseline marker-based EBMT system outperforms any flavour of the PBSbIT systems constructed in these experiments, combining the data sets automatically induced by both GIZA++ and the EBMT system leads to a hybrid system which improves on the EBMT system per se for French-English. On a different data set, taken from the Europarl corpus (Koehn, 2005), we perform a number of experiments maklng use of incremental training data sizes of 78K, 156K and 322K sentence pairs. On this data set, we show that similar gains are to be had from constructing a hybrid 'statistical EBMT' system capable of outperforming the baseline EBMT system. This time around, although all 'hybrid' variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in the Sun Mzcrosystems experiments, to create a hybrid 'example-based SMT' system, outperforms the baseline SMT and EBMT systems from which it is derlved. Furthermore, we provide further evidence in favour of hybrid data-dr~ven approaches by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive effect on translation quality. Following on from these findings we present a new hybrid data-driven MT architecture, together with a novel marker-based decoder which improves upon the performance of the marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005), and compares favourably with the stateof-the-art PHARAOH SMHT decoder (Koehn, 2004a)

CiteSeerX

Irish Universities

DCU Online Research Access Service

Resourcing machine translation with parallel treebanks

Author: Tinsley John
Publication venue: Dublin City University. School of Computing
Publication date: 01/03/2010
Field of study

The benefits of syntax-based approaches to data-driven machine translation (MT) are clear: given the right model, a combination of hierarchical structure, constituent labels and morphological information can be exploited to produce more fluent, grammatical translation output. This has been demonstrated by the recent shift in research focus towards such linguistically motivated approaches. However, one issue facing developers of such models that is not encountered in the development of state-of-the-art string-based statistical MT (SMT) systems is the lack of available syntactically annotated training data for many languages. In this thesis, we propose a solution to the problem of limited resources for syntax-based MT by introducing a novel sub-sentential alignment algorithm for the induction of translational equivalence links between pairs of phrase structure trees. This algorithm, which operates on a language pair-independent basis, allows for the automatic generation of large-scale parallel treebanks which are useful not only for machine translation, but also across a variety of natural language processing tasks. We demonstrate the viability of our automatically generated parallel treebanks by means of a thorough evaluation process during which they are compared to a manually annotated gold standard parallel treebank both intrinsically and in an MT task. Following this, we hypothesise that these parallel treebanks are not only useful in syntax-based MT, but also have the potential to be exploited in other paradigms of MT. To this end, we carry out a large number of experiments across a variety of data sets and language pairs, in which we exploit the information encoded within the parallel treebanks in various components of phrase-based statistical MT systems. We demonstrate that improvements in translation accuracy can be achieved by enhancing SMT phrase tables with linguistically motivated phrase pairs extracted from a parallel treebank, while showing that a number of other features in SMT can also be supplemented with varying degrees of effectiveness. Finally, we examine ways in which synchronous grammars extracted from parallel treebanks can improve the quality of translation output, focussing on real translation examples from a syntax-based MT system

Irish Universities

DCU Online Research Access Service