40 research outputs found
Formemes in English-Czech Deep Syntactic MT
One of the most notable recent improvements of the TectoMT English-to-Czech translation is a systematic and theoretically supported revision of formemes—the annotation of morpho-syntactic features of content words in deep dependency syntactic structures based on the Prague tectogrammatics theory. Our modifications aim at reducing data sparsity, increasing consistency across languages and widening the usage area of this markup. Formemes can be used not only in MT, but in various other NLP tasks
Towards English-to-Czech MT via Tectogrammatical Layer
Proceedings of the Sixth International Workshop on Treebanks and
Linguistic Theories.
Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler.
NEALT Proceedings Series, Vol. 1 (2007), 7-18.
© 2007 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/4476
Translation of "It" in a Deep Syntax Framework
We present a novel approach to the translation of the English personal pronoun it to Czech. We conduct a linguistic analysis on how the distinct categories of it are usually mapped to their Czech counterparts. Armed with these observations, we design a discriminative translation model of it, which is then integrated into the TectoMT deep syntax MT framework. Features in the model take advantage of rich syntactic annotation TectoMT is based on, external
tools for anaphoricity resolution, lexical co-occurrence frequencies measured on a large parallel corpus and gold coreference annotation. Even though the new model for it exhibits no improvement in terms of BLEU, manual evaluation shows that it outperforms the original solution in
8.5% sentences containing it
SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task
This paper presents the description of 12
systems submitted to the WMT16 IT-task,
covering six different languages, namely
Basque, Bulgarian, Dutch, Czech, Portuguese
and Spanish. All these systems
were developed under the scope of the
QTLeap project, presenting a common
strategy. For each language two different
systems were submitted, namely a phrase-based
MT system built using Moses, and
a system exploiting deep language engineering
approaches, that in all the languages
but Bulgarian was implemented
using TectoMT. For 4 of the 6 languages,
the TectoMT-based system performs better
than the Moses-based one
Syntax and Rich Morphology in MT
The talk describes in detail the issues specific to English-to-Czech MT: sentence syntax and target-side rich morphology
Treex: Modular NLP Framework
Treex is a multi-purpose open-source natural language processing (NLP) framework implemented in Perl programming language. It allows for fast and efficient development of NLP applications by exploiting a wide range of software modules already integrated in Treex, such as tools for sentence segmentation, tokenization, morphological analysis, POS tagging, shallow and deep syntax parsing, named entity recognition, anaphora resolution, tree-to-tree translation, natural language generation, word-level alignment of parallel corpora, and other tasks. One of the most complex applications of Treex is the English-Czech machine translation system TectoMT. Several modules are available also for other languages (German, Russian, Arabic). Where possible, modules are implemented in a language-independent way, so they can be reused in many applications
New Language Pairs in TectoMT
The TectoMT tree-to-tree machine translation system has been updated this year to support easier retraining for more translation directions. We use multilingual standards for morphology and syntax annotation and language-independent base rules. We include a simple, non-parametric way of combining TectoMT’s transfer model outputs
Using Parallel Texts and Lexicons for Verbal Word Sense Disambiguation
We present a system for verbal Word Sense Disambiguation (WSD) that is able to exploit additional information from parallel texts and lexicons. It is an extension of our previous WSD method, which gave promising results but used only monolingual features. In the follow-up work described here, we have explored two additional ideas: using English-Czech bilingual resources (as features only - the task itself remains a monolingual WSD task), and using a 'hybrid' approach, adding features extracted both from a parallel corpus and from manually aligned bilingual valency lexicon entries, which contain subcategorization information. Albeit not all types of features proved useful, both ideas and additions have led to significant improvements for both languages explored