87 research outputs found

    Exploration of Corpus Augmentation Approach for English-Hindi Bidirectional Statistical Machine Translation System

    Get PDF
    Even though lot of Statistical Machine Translation(SMT) research work is happening for English-Hindi language pair, there is no effort done to standardize the dataset. Each of the research work uses different dataset, different parameters and different number of sentences during various phases of translation resulting in varied translation output. So comparing  these models, understand the result of these models, to get insight into corpus behavior for these models, regenerating the result of these research work  becomes tedious. This necessitates the need for standardization of dataset and to identify the common parameter for the development of model.  The main contribution of this paper is to discuss an approach to standardize the dataset and to identify the best parameter which in combination gives best performance. It also investigates a novel corpus augmentation approach to improve the translation quality of English-Hindi bidirectional statistical machine translation system. This model works well for the scarce resource without incorporating the external parallel data corpus of the underlying language.  This experiment is carried out using Open Source phrase-based toolkit Moses. Indian Languages Corpora Initiative (ILCI) Hindi-English tourism corpus is used.  With limited dataset, considerable improvement is achieved using the corpus augmentation approach for the English-Hindi bidirectional SMT system

    Mitigating the problems of SMT using EBMT

    Get PDF
    Statistical Machine Translation (SMT) typically has difficulties with less-resourced languages even with homogeneous data. In this thesis we address the application of Example-Based Machine Translation (EBMT) methods to overcome some of these difficulties. We adopt three alternative approaches to tackle these problems focusing on two poorly-resourced translation tasks (English–Bangla and English–Turkish). First, we adopt a runtime approach to EBMT using proportional analogy. In addition to the translation task, we have tested the EBMT system using proportional analogy for named entity transliteration. In the second attempt, we use a compiled approach to EBMT. Finally, we present a novel way of integrating Translation Memory (TM) into an EBMT system. We discuss the development of these three different EBMT systems and the experiments we have performed. In addition, we present an approach to augment the output quality by strategically combining EBMT systems and SMT systems. The hybrid system shows significant improvement for different language pairs. Runtime EBMT systems in general have significant time complexity issues especially for large example-base. We explore two methods to address this issue in our system by making the system scalable at runtime for a large example-base (English–French). First, we use a heuristic-based approach. Secondly we use an IR-based indexing technique to speed up the time-consuming matching procedure of the EBMT system. The index-based matching procedure substantially improves run-time speed without affecting translation quality

    Developing a Chunk-based Grammar Checker for Translated English Sentences

    Get PDF

    Training Deployable General Domain MT for a Low Resource Language Pair: English–Bangla

    Get PDF
    A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping

    Neural Machine Translation from Bengali Language to English language and vice-versa

    Get PDF
    Bengali ranks among the first ten spoken languages in the world with a native speaker numbering about 230 million people.  With UNESCO declaring 21st February as International Mother Language Day to commemorate the laying down of lives by five Bangladeshi students for the cause of their mother tongue, Bengali has come into the radar of worldwide  attention . Though significant amount of prose, poetry have been written in Bengali language and large number of newspapers in Bengali get published daily, technically it is still considered a Low Resource Language (LRL) unlike English or French which are High Resource Language (HRL). The reason is not far to seek as corpora in varied domains such as short stories, sports, politics, agriculture etc is less in number and even when they are available, the size is less. Machine translation (MT) is difficult to perform in Bengali as parallel corpora from Bengali to other languages and vice versa is few and far between and when they are available they suffer from the problems of size and quality. This work is aimed at implementing one state of the art model in Neural Machine Translation (NMT) which is called the self-attention transformer model to perform translation from English to Bengali and vice versa. Though a couple of research work has been published in the recent years on MT from English to Bengali, they are mostly domain specific. This paper does not focus on any specific domain for NMT from English to Bengali and as such may be conceived as a more of general domain NMT from English to Bengali which is more difficult than domain specific NMT. Performance evaluation of the model was done  using BLEU version-4  vis-à-vis translations of well known English-Bengali MTsystems

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Venetan to English machine translation: issues and possible solutions

    Get PDF
    In this paper we describe a prototype of a Venetan to English translation system developed under the STILVEN project financed by the Regional Authorities of Veneto Region in Italy. The general approach is a statistical one with some preprocessing operations both at training and translation time (ortographic normalization and POS tagging to make use of factored models) which are needed especially to overcome two main problems: the scarcity of Venetan resources (our Venetan-English corpus is made up of only 13,000 sentences, amounting to 128,000 Venetan tokens excluding punctuation) and the diasystemic nature of Venetan, which really represents an ensemble of varieties rather than a single dialect. We will present in detail the problems related to Venetan, our ideas to solve them, their implementation and the results obtained so far

    Evaluation Review on Effectiveness and Security Performances of Text Steganography Technique

    Get PDF
    Steganography is one of the categories in information hiding that is implemented to conceal the hidden message to ensure it cannot be recognized by human vision. This paper focuses on steganography implementation in text domain namely text steganography.Text steganography consists of two groups, which are word-rule based and feature-based techniques.This paper analysed these two categories of text steganography based on effectiveness and security evaluation because the effectiveness is critically important in order to determine that technique has the appropriate quality.Meanwhile, the security is important due to the intensity performance in securing the hidden message. The main goal of this paper is to review the evaluation of text steganography in terms of effectiveness and security that have been developed by previous research efforts. It is anticipated that this paper will identify the performance of text steganography based on effectiveness and security measurement
    corecore