793 research outputs found
Unsupervised generation of parallel treebanks through sub-tree alignment
The need for syntactically annotated data for use in natural language processing has increased dramatically
in recent years. This is true especially for parallel treebanks, of which very few exist. The ones
that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this
paper we introduce an open-source system for fast and robust automatic generation of parallel treebanks.
We expect the opening of the presented platform to the scientific community to help boost research
in the field of data-oriented machine translation and lead to advancements in other fields where
parallel treebanks can be employed
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
One model, two languages: training bilingual parsers with harmonized treebanks
We introduce an approach to train lexicalized parsers using bilingual corpora
obtained by merging harmonized treebanks of different languages, producing
parsers that can analyze sentences in either of the learned languages, or even
sentences that mix both. We test the approach on the Universal Dependency
Treebanks, training with MaltParser and MaltOptimizer. The results show that
these bilingual parsers are more than competitive, as most combinations not
only preserve accuracy, but some even achieve significant improvements over the
corresponding monolingual parsers. Preliminary experiments also show the
approach to be promising on texts with code-switching and when more languages
are added.Comment: 7 pages, 4 tables, 1 figur
Comparing constituency and dependency representations for SMT phrase-extraction
We consider the value of replacing and/or combining string-based methods with syntax-based methods for phrase-based statistical machine translation (PBSMT),
and we also consider the relative merits of using constituency-annotated vs. dependency-annotated training data. We automatically derive two subtree-aligned treebanks,
dependency-based and constituency-based, from a parallel EnglishâFrench corpus and extract syntactically motivated word- and phrase-pairs. We automatically measure PB-SMT quality. The results show that combining string-based and syntax-based word- and phrase-pairs can improve translation quality irrespective of the type of syntactic annotation. Furthermore, using dependency annotation yields greater translation quality than constituency annotation for PB-SMT
An Integrated Framework for Treebanks and Multilayer Annotations
Treebank formats and associated software tools are proliferating rapidly,
with little consideration for interoperability. We survey a wide variety of
treebank structures and operations, and show how they can be mapped onto the
annotation graph model, and leading to an integrated framework encompassing
tree and non-tree annotations alike. This development opens up new
possibilities for managing and exploiting multilayer annotations.Comment: 8 page
Using the Stockholm TreeAligner
Proceedings of the Sixth International Workshop on Treebanks and
Linguistic Theories.
Editors: Koenraad De Smedt, Jan HajiÄ and Sandra KĂŒbler.
NEALT Proceedings Series, Vol. 1 (2007), 73-78.
© 2007 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/4476
Bootstrapping parallel treebanks
This paper argues for the development of parallel treebanks. It summarizes the work done in this area and reports on experiments for building a Swedish-German treebank. And it
describes our approach for reusing resources from one language while annotating another language
- âŠ