6,842 research outputs found
Bilingually motivated domain-adapted word segmentation for statistical machine translation
We introduce a word segmentation approach to languages where word boundaries are not orthographically marked,
with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is
adapted for the specific translation task at hand by taking the corresponding source (target) language into account. Secondly, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and
demonstrate that our approach scores consistently among the best results across different data conditions
Transitive probabilistic CLIR models.
Transitive translation could be a useful technique to enlarge the number of supported language pairs for a cross-language information retrieval (CLIR) system in a cost-effective manner. The paper describes several setups for transitive translation based on probabilistic translation models. The transitive CLIR models were evaluated on the CLEF test collection and yielded a retrieval effectiveness\ud
up to 83% of monolingual performance, which is significantly better than a baseline using the synonym operator
Seeding statistical machine translation with translation memory output through tree-based structural alignment
With the steadily increasing demand for high-quality translation, the localisation industry is constantly searching for technologies that would increase translator
throughput, with the current focus on the use of high-quality Statistical Machine Translation (SMT) as a supplement to the established Translation Memory (TM)
technology. In this paper we present a novel modular approach that utilises state-of-the-art sub-tree alignment to pick out pre-translated segments from a TM match and seed with them an SMT system to produce a final translation. We show that the presented system can outperform pure SMT when a good TM match is found. It can also be used in a Computer-Aided Translation (CAT) environment to present almost perfect translations to the human user with markup highlighting the segments of the translation that need to be checked manually for correctness
Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level
Text alignment and text quality are critical to the accuracy of Machine
Translation (MT) systems, some NLP tools, and any other text processing tasks
requiring bilingual data. This research proposes a language independent
bi-sentence filtering approach based on Polish (not a position-sensitive
language) to English experiments. This cleaning approach was developed on the
TED Talks corpus and also initially tested on the Wikipedia comparable corpus,
but it can be used for any text domain or language pair. The proposed approach
implements various heuristics for sentence comparison. Some of them leverage
synonyms and semantic and structural analysis of text as additional
information. Minimization of data loss was ensured. An improvement in MT system
score with text processed using the tool is discussed.Comment: arXiv admin note: text overlap with arXiv:1509.09093,
arXiv:1509.0888
- …