6,842 research outputs found

    Bilingually motivated domain-adapted word segmentation for statistical machine translation

    Get PDF
    We introduce a word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is adapted for the specific translation task at hand by taking the corresponding source (target) language into account. Secondly, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and demonstrate that our approach scores consistently among the best results across different data conditions

    Transitive probabilistic CLIR models.

    Get PDF
    Transitive translation could be a useful technique to enlarge the number of supported language pairs for a cross-language information retrieval (CLIR) system in a cost-effective manner. The paper describes several setups for transitive translation based on probabilistic translation models. The transitive CLIR models were evaluated on the CLEF test collection and yielded a retrieval effectiveness\ud up to 83% of monolingual performance, which is significantly better than a baseline using the synonym operator

    Seeding statistical machine translation with translation memory output through tree-based structural alignment

    Get PDF
    With the steadily increasing demand for high-quality translation, the localisation industry is constantly searching for technologies that would increase translator throughput, with the current focus on the use of high-quality Statistical Machine Translation (SMT) as a supplement to the established Translation Memory (TM) technology. In this paper we present a novel modular approach that utilises state-of-the-art sub-tree alignment to pick out pre-translated segments from a TM match and seed with them an SMT system to produce a final translation. We show that the presented system can outperform pure SMT when a good TM match is found. It can also be used in a Computer-Aided Translation (CAT) environment to present almost perfect translations to the human user with markup highlighting the segments of the translation that need to be checked manually for correctness

    Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level

    Get PDF
    Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems, some NLP tools, and any other text processing tasks requiring bilingual data. This research proposes a language independent bi-sentence filtering approach based on Polish (not a position-sensitive language) to English experiments. This cleaning approach was developed on the TED Talks corpus and also initially tested on the Wikipedia comparable corpus, but it can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence comparison. Some of them leverage synonyms and semantic and structural analysis of text as additional information. Minimization of data loss was ensured. An improvement in MT system score with text processed using the tool is discussed.Comment: arXiv admin note: text overlap with arXiv:1509.09093, arXiv:1509.0888
    corecore