1,943 research outputs found
Unsupervised generation of parallel treebanks through sub-tree alignment
The need for syntactically annotated data for use in natural language processing has increased dramatically
in recent years. This is true especially for parallel treebanks, of which very few exist. The ones
that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this
paper we introduce an open-source system for fast and robust automatic generation of parallel treebanks.
We expect the opening of the presented platform to the scientific community to help boost research
in the field of data-oriented machine translation and lead to advancements in other fields where
parallel treebanks can be employed
Irish treebanking and parsing: a preliminary evaluation
Language resources are essential for linguistic research and the development of NLP applications. Low- density languages, such as Irish, therefore lack significant research in this area. This paper describes the early stages in the development of new language resources for Irish – namely the first Irish dependency treebank and the first Irish statistical dependency parser. We present the methodology behind building our new treebank and the steps we take to leverage upon the few existing resources. We discuss language specific choices made when defining our dependency labelling scheme, and describe interesting Irish language characteristics such as prepositional attachment, copula and clefting. We manually develop a small treebank of 300 sentences based on an existing POS-tagged corpus and report an inter-annotator agreement of 0.7902. We train MaltParser to achieve preliminary parsing results for Irish and describe a bootstrapping approach for further stages of development
Wrapper syntax for example-based machine translation
TransBooster is a wrapper technology designed to improve the performance of wide-coverage machine translation
systems. Using linguistically motivated syntactic information, it automatically decomposes source language sentences into shorter and syntactically simpler chunks, and recomposes their translation to form target language sentences. This generally improves both the word order
and lexical selection of the translation. To date, TransBooster has been successfully applied to rule-based MT, statistical MT, and multi-engine MT. This paper presents
the application of TransBooster to Example-Based Machine Translation. In an experiment conducted on test sets
extracted from Europarl and the Penn II Treebank we show that our method can raise the BLEU score up to 3.8% relative
to the EBMT baseline. We also conduct a manual evaluation, showing that TransBooster-enhanced EBMT produces
a better output in terms of fluency than the baseline EBMT in 55% of the cases and in terms of accuracy in 53% of the
cases
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
Evaluating syntax-driven approaches to phrase extraction for MT
In this paper, we examine a number of different phrase segmentation approaches for Machine Translation and how they perform when used to supplement the translation model of a phrase-based SMT system. This work represents a summary of a number of years of research carried out at Dublin City University in which it has been found that improvements can be made using hybrid translation
models. However, the level of improvement achieved is dependent on the amount of training data used. We describe the various approaches to phrase segmentation and combination explored, and outline a series of experiments investigating the relative merits of each method
Joint Morphological and Syntactic Disambiguation
In morphologically rich languages, should morphological and syntactic disambiguation be treated sequentially or as a single problem? We describe several efficient, probabilistically interpretable ways to apply joint inference to morphological and syntactic disambiguation using lattice parsing. Joint inference is shown to compare favorably to pipeline parsing methods across a variety of component models. State-of-the-art performance on Hebrew Treebank parsing is demonstrated using the new method. The benefits of joint inference are modest with the current component models, but appear to increase as components themselves improve
- …