Search CORE

14 research outputs found

Automatic Discovery of Non-Compositional Compounds in Parallel Data

Author: Melamed I. Dan
Publication venue
Publication date: 01/01/1997
Field of study

Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discovering sequences of words that are translated as a unit. The method proceeds by comparing pairs of statistical translation models induced from parallel texts in two languages. It can discover hundreds of non-compositional compounds on each iteration, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method's potential to improve the quality of MT output. The method makes few assumptions about the data, so it can be applied to parallel data other than parallel texts, such as word spellings and pronunciations.Comment: 12 pages; uses natbib.sty, here.st

arXiv.org e-Print Archive

CiteSeerX

Bootstrapping word alignment via word packing

Author: Ma Yanjun
Stroppa Nicolas
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2007
Field of study

We introduce a simple method to pack words for statistical word alignment. Our goal is to simplify the task of automatic word alignment by packing several consecutive words together when we believe they correspond to a single word in the opposite language. This is done using the word aligner itself, i.e. by bootstrapping on its output. We evaluate the performance of our approach on a Chinese-to-English machine translation task, and report a 12.2% relative increase in BLEU score over a state-of-the art phrase-based SMT system

Irish Universities

DCU Online Research Access Service

Multilingual domain modeling in Twenty-One: automatic creation of a bi-directional translation lexicon from a parallel corpus

Author: Hiemstra Djoerd
Publication venue: Rodopi
Publication date: 01/01/1998
Field of study

Within the project Twenty-One, which aims at the effective dissemination of information on ecology and sustainable development, a sytem is developed that supports cross-language information retrieval in any of the four languages Dutch, English, French and German. Knowledge of this application domain is needed to enhance existing translation resources for the purpose of lexical disambiguation. This paper describes an algorithm for the automated acquisition of a translation lexicon from a parallel corpus. New about the presented algorithm is the statistical language model used. Because the algorithm is based on a symmetric translation model it becomes possible to identify one-to-many and many-to-one relations between words of a language pair. We claim that the presented method has two advantages over algorithms that have been published before. Firstly, because the translation model is more powerful, the resulting bilingual lexicon will be more accurate. Secondly, the resulting bilingual lexicon can be used to translate in both directions between a language pair. Different versions of the algorithm were evaluated on the Dutch and English version of the Agenda 21 corpus, which is a UN document on the application domain of sustainable development

CiteSeerX

Radboud Repository

University of Twente Research Information

Mining Parallel Text from the Web based on Sentence Alignment

Author: Li Bo
Liu Juan
Zhu Huili
Publication venue: The Korean Society for Language and Information (KSLI)
Publication date: 01/01/2007
Field of study

PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 200

Waseda University Repository

Evaluating Machine Translation Performance on Chinese Idioms with a Blacklist Method

Author: Fancellu Federico
Sennrich Rico
Shao Yutong
Webber Bonnie L.
Publication venue
Publication date: 20/02/2018
Field of study

Idiom translation is a challenging problem in machine translation because the meaning of idioms is non-compositional, and a literal (word-by-word) translation is likely to be wrong. In this paper, we focus on evaluating the quality of idiom translation of MT systems. We introduce a new evaluation method based on an idiom-specific blacklist of literal translations, based on the insight that the occurrence of any blacklisted words in the translation output indicates a likely translation error. We introduce a dataset, CIBB (Chinese Idioms Blacklists Bank), and perform an evaluation of a state-of-the-art Chinese-English neural MT system. Our evaluation confirms that a sizable number of idioms in our test set are mistranslated (46.1%), that literal translation error is a common error type, and that our blacklist method is effective at identifying literal translation errors.Comment: Full paper accepted by LREC, 8 page

arXiv.org e-Print Archive

Edinburgh Research Explorer

Automatic extraction of Arabic multiword expressions

Author: Attia Mohammed
Pecina Pavel
Toral Antonio
Tounsi Lamia
van Genabith Josef
Publication venue
Publication date: 01/01/2010
Field of study

In this paper we investigate the automatic acquisition of Arabic Multiword Expressions (MWE). We propose three complementary approaches to extract MWEs from available data resources. The first approach relies on the correspondence asymmetries between Arabic Wikipedia titles and titles in 21 different languages. The second approach collects English MWEs from Princeton WordNet 3.0, translates the collection into Arabic using Google Translate, and utilizes different search engines to validate the output. The third uses lexical association measures to extract MWEs from a large unannotated corpus. We experimentally explore the feasibility of each approach and measure the quality and coverage of the output against gold standards

CiteSeerX

Irish Universities

DCU Online Research Access Service

Automatic Acquisition of Knowledge About Multiword Predicates

Author: Fazly Afsaneh
Stevenson Suzanne
Publication venue: Institute of Linguistics, Academia Sinica
Publication date: 01/01/2005
Field of study

PACLIC 19 / Taipei, taiwan / December 1-3, 200

Waseda University Repository

A Statistical Approach to the Semantics of Verb-Particles

Author: Baldwin Timothy
Bannard Colin
Lascarides Alex
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2003
Field of study

This paper describes a distributional approach to the semantics of verb-particle constructions (e.g. put up, make off ). We report first on a framework for implementing and evaluating such models. We then go on to report on the implementation of some techniques for using statistical models acquired from corpus data to infer the meaning of verb-particle constructions

University of Liverpool Repository

CiteSeerX

Crossref

Edinburgh Research Explorer