87 research outputs found
Exploration of Corpus Augmentation Approach for English-Hindi Bidirectional Statistical Machine Translation System
Even though lot of Statistical Machine Translation(SMT) research work is happening for English-Hindi language pair, there is no effort done to standardize the dataset. Each of the research work uses different dataset, different parameters and different number of sentences during various phases of translation resulting in varied translation output. So comparing these models, understand the result of these models, to get insight into corpus behavior for these models, regenerating the result of these research work becomes tedious. This necessitates the need for standardization of dataset and to identify the common parameter for the development of model. The main contribution of this paper is to discuss an approach to standardize the dataset and to identify the best parameter which in combination gives best performance. It also investigates a novel corpus augmentation approach to improve the translation quality of English-Hindi bidirectional statistical machine translation system. This model works well for the scarce resource without incorporating the external parallel data corpus of the underlying language. This experiment is carried out using Open Source phrase-based toolkit Moses. Indian Languages Corpora Initiative (ILCI) Hindi-English tourism corpus is used. With limited dataset, considerable improvement is achieved using the corpus augmentation approach for the English-Hindi bidirectional SMT system
Mitigating the problems of SMT using EBMT
Statistical Machine Translation (SMT) typically has difficulties with less-resourced languages even with homogeneous data. In this thesis we address the application of Example-Based Machine Translation (EBMT) methods to overcome some of these difficulties. We adopt three alternative approaches to tackle these problems focusing
on two poorly-resourced translation tasks (English–Bangla and English–Turkish). First, we adopt a runtime approach to EBMT using proportional analogy. In addition to the translation task, we have tested the EBMT system using proportional analogy for named entity transliteration. In the second attempt, we use a compiled approach to EBMT. Finally, we present a novel way of integrating Translation Memory (TM) into an EBMT system. We discuss the development of these three different EBMT systems and the experiments we have performed. In addition, we present an approach to augment the output quality by strategically combining EBMT systems and SMT systems. The hybrid system shows significant improvement for different language pairs.
Runtime EBMT systems in general have significant time complexity issues especially for large example-base. We explore two methods to address this issue in our system by making the system scalable at runtime for a large example-base (English–French). First, we use a heuristic-based approach. Secondly we use an IR-based indexing technique to speed up the time-consuming matching procedure of the EBMT system. The index-based matching procedure substantially improves run-time speed without affecting translation quality
Training Deployable General Domain MT for a Low Resource Language Pair: English–Bangla
A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping
Neural Machine Translation from Bengali Language to English language and vice-versa
Bengali ranks among the first ten spoken languages in the world with a native speaker numbering about 230 million people. With UNESCO declaring 21st February as International Mother Language Day to commemorate the laying down of lives by five Bangladeshi students for the cause of their mother tongue, Bengali has come into the radar of worldwide attention . Though significant amount of prose, poetry have been written in Bengali language and large number of newspapers in Bengali get published daily, technically it is still considered a Low Resource Language (LRL) unlike English or French which are High Resource Language (HRL). The reason is not far to seek as corpora in varied domains such as short stories, sports, politics, agriculture etc is less in number and even when they are available, the size is less. Machine translation (MT) is difficult to perform in Bengali as parallel corpora from Bengali to other languages and vice versa is few and far between and when they are available they suffer from the problems of size and quality. This work is aimed at implementing one state of the art model in Neural Machine Translation (NMT) which is called the self-attention transformer model to perform translation from English to Bengali and vice versa. Though a couple of research work has been published in the recent years on MT from English to Bengali, they are mostly domain specific. This paper does not focus on any specific domain for NMT from English to Bengali and as such may be conceived as a more of general domain NMT from English to Bengali which is more difficult than domain specific NMT. Performance evaluation of the model was done using BLEU version-4 vis-à-vis translations of well known English-Bengali MTsystems
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
Venetan to English machine translation: issues and possible solutions
In this paper we describe a prototype of a Venetan to English
translation system developed under the STILVEN project financed by the Regional
Authorities of Veneto Region in Italy. The general approach is a
statistical one with some preprocessing operations both at training and
translation time (ortographic normalization and POS tagging to make
use of factored models) which are needed especially to overcome two
main problems: the scarcity of Venetan resources (our Venetan-English
corpus is made up of only 13,000 sentences, amounting to 128,000 Venetan
tokens excluding punctuation) and the diasystemic nature of Venetan,
which really represents an ensemble of varieties rather than a single
dialect. We will present in detail the problems related to Venetan, our
ideas to solve them, their implementation and the results obtained so
far
Evaluation Review on Effectiveness and Security Performances of Text Steganography Technique
Steganography is one of the categories in information hiding that is implemented to conceal the hidden message to ensure it cannot be recognized by human vision. This paper focuses on steganography implementation in text domain namely text steganography.Text steganography consists of two groups, which are word-rule based and feature-based techniques.This paper analysed these two categories of text steganography based on effectiveness and security evaluation because the effectiveness is critically important in order to determine that technique has the appropriate quality.Meanwhile, the security is important due to the intensity performance in securing the hidden message. The main goal of this paper is to review the evaluation of text steganography in terms of effectiveness and security that have been developed by previous research efforts. It is anticipated that this paper will identify the performance of text steganography based on effectiveness and security measurement
- …