290 research outputs found
Unsupervised generation of parallel treebanks through sub-tree alignment
The need for syntactically annotated data for use in natural language processing has increased dramatically
in recent years. This is true especially for parallel treebanks, of which very few exist. The ones
that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this
paper we introduce an open-source system for fast and robust automatic generation of parallel treebanks.
We expect the opening of the presented platform to the scientific community to help boost research
in the field of data-oriented machine translation and lead to advancements in other fields where
parallel treebanks can be employed
Seeding statistical machine translation with translation memory output through tree-based structural alignment
With the steadily increasing demand for high-quality translation, the localisation industry is constantly searching for technologies that would increase translator
throughput, with the current focus on the use of high-quality Statistical Machine Translation (SMT) as a supplement to the established Translation Memory (TM)
technology. In this paper we present a novel modular approach that utilises state-of-the-art sub-tree alignment to pick out pre-translated segments from a TM match and seed with them an SMT system to produce a final translation. We show that the presented system can outperform pure SMT when a good TM match is found. It can also be used in a Computer-Aided Translation (CAT) environment to present almost perfect translations to the human user with markup highlighting the segments of the translation that need to be checked manually for correctness
Highlighting matched and mismatched segments in translation memory output through sub-tree alignment
In recent years, it is becoming more and more clear that the
localisation industry does not have the necessary manpower to satisfy the increasing demand for high-quality translation. This has fuelled the search new and existing technologies that would increase translator throughput. As Translation Memory (TM) systems are the most commonly employed tool by translators, a number of enhancements are
available to assist them in their job. One such enhancement would be to show the translator which parts of the sentence
that needs to be translated match which parts of the fuzzy
match suggested by the TM. For this information to be used,
however, the translators have to carry it over to the TM
translation themselves. In this paper, we present a novel methodology that can automatically detect and highlight
the segments that need to be modified in a TM-suggested
translation. We base it on state-of-the-art sub-tree align-
ment technology (Zhechev,2010) that can produce aligned
phrase-based-tree pairs from unannotated data. Our system
operates in a three-step process. First, the fuzzy match
selected by the TM and its translation are aligned. This
lets us know which segments of the source-language sentence
correspond to which segments in its translation. In the
second step, the fuzzy match is aligned to the input sentence that is currently being translated. This tells us
which parts of the input sentence are available in the fuzzy
match and which still need to be translated. In the third
step, the fuzzy match is used as an intermediary, through
which the alignments between the input sentence and the TM
translation are established. In this way, we can detect with
precision the segments in the suggested translation that the
translator needs to edit and highlight them appropriately to
set them apart from the segments that are already good translations for parts of the input sentence. Additionally,
we can show the alignments—as detected by our system—between
the input and the translation, which will make it even easier for the translator to post-edit the TM suggestion. This alignment information can additionally be used to pre-
translate the mismatched segments, further reducing the post-editing load
Building and querying parallel treebanks
This paper describes our work on building a trilingual parallel treebank. We have annotated constituent structure trees from three text genres (a philosophy novel, economy reports and a technical user manual). Our parallel treebank includes word and phrase alignments. The alignment information was manually checked using a graphical tool that allows the annotator to view a pair of trees from parallel sentences. This tool comes with a powerful search facility which supersedes the expressivity of previous popular treebank query engines
Bootstrapping parallel treebanks
This paper argues for the development of parallel treebanks. It summarizes the work done in this area and reports on experiments for building a Swedish-German treebank. And it
describes our approach for reusing resources from one language while annotating another language
Annotating a Parallel Monolingual Treebank with Semantic Similarity Relations
Proceedings of the Sixth International Workshop on Treebanks and
Linguistic Theories.
Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler.
NEALT Proceedings Series, Vol. 1 (2007), 85-96.
© 2007 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/4476
Annotation, exploitation and evaluation of parallel corpora
Exchange between the translation studies and the computational linguistics communities has traditionally not been very intense. Among other things, this is reflected by the different views on parallel corpora. While computational linguistics does not always strictly pay attention to the translation direction (e.g. when translation rules are extracted from (sub)corpora which actually only consist of translations), translation studies are amongst other things concerned with exactly comparing source and target texts (e.g. to draw conclusions on interference and standardization effects). However, there has recently been more exchange between the two fields – especially when it comes to the annotation of parallel corpora. This special issue brings together the different research perspectives. Its contributions show – from both perspectives – how the communities have come to interact in recent years
Using the Stockholm TreeAligner
Proceedings of the Sixth International Workshop on Treebanks and
Linguistic Theories.
Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler.
NEALT Proceedings Series, Vol. 1 (2007), 73-78.
© 2007 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/4476
- …