Search CORE

87 research outputs found

Exploration of Corpus Augmentation Approach for English-Hindi Bidirectional Statistical Machine Translation System

Author: Gupta Deepa
Jaya K.
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/06/2016
Field of study

Even though lot of Statistical Machine Translation(SMT) research work is happening for English-Hindi language pair, there is no effort done to standardize the dataset. Each of the research work uses different dataset, different parameters and different number of sentences during various phases of translation resulting in varied translation output. So comparing these models, understand the result of these models, to get insight into corpus behavior for these models, regenerating the result of these research work becomes tedious. This necessitates the need for standardization of dataset and to identify the common parameter for the development of model. The main contribution of this paper is to discuss an approach to standardize the dataset and to identify the best parameter which in combination gives best performance. It also investigates a novel corpus augmentation approach to improve the translation quality of English-Hindi bidirectional statistical machine translation system. This model works well for the scarce resource without incorporating the external parallel data corpus of the underlying language. This experiment is carried out using Open Source phrase-based toolkit Moses. Indian Languages Corpora Initiative (ILCI) Hindi-English tourism corpus is used. With limited dataset, considerable improvement is achieved using the corpus augmentation approach for the English-Hindi bidirectional SMT system

Crossref

Institute of Advanced Engineering and Science

Mitigating the problems of SMT using EBMT

Author: Dandapat Sandipan
Publication venue: Dublin City University. School of Computing
Publication date: 01/11/2012
Field of study

Statistical Machine Translation (SMT) typically has difficulties with less-resourced languages even with homogeneous data. In this thesis we address the application of Example-Based Machine Translation (EBMT) methods to overcome some of these difficulties. We adopt three alternative approaches to tackle these problems focusing on two poorly-resourced translation tasks (English–Bangla and English–Turkish). First, we adopt a runtime approach to EBMT using proportional analogy. In addition to the translation task, we have tested the EBMT system using proportional analogy for named entity transliteration. In the second attempt, we use a compiled approach to EBMT. Finally, we present a novel way of integrating Translation Memory (TM) into an EBMT system. We discuss the development of these three different EBMT systems and the experiments we have performed. In addition, we present an approach to augment the output quality by strategically combining EBMT systems and SMT systems. The hybrid system shows significant improvement for different language pairs. Runtime EBMT systems in general have significant time complexity issues especially for large example-base. We explore two methods to address this issue in our system by making the system scalable at runtime for a large example-base (English–French). First, we use a heuristic-based approach. Secondly we use an IR-based indexing technique to speed up the time-consuming matching procedure of the EBMT system. The index-based matching procedure substantially improves run-time speed without affecting translation quality

DCU Online Research Access Service

Developing a Chunk-based Grammar Checker for Translated English Sentences

Author: Lin Nay Yee
Soe Khin Mar
Thein Ni Lar
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

Training Deployable General Domain MT for a Low Resource Language Pair: English–Bangla

Author: Dandapat Sandipan
Lewis William
Publication venue: European Association for Machine Translation
Publication date: 01/01/2018
Field of study

A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping

Repositorio Institucional de la Universidad de Alicante

Neural Machine Translation from Bengali Language to English language and vice-versa

Author: Arindam Roy et al.
Publication venue: Auricle Global Society of Education and Research
Publication date: 05/11/2023
Field of study

Bengali ranks among the first ten spoken languages in the world with a native speaker numbering about 230 million people.  With UNESCO declaring 21st February as International Mother Language Day to commemorate the laying down of lives by five Bangladeshi students for the cause of their mother tongue, Bengali has come into the radar of worldwide  attention . Though significant amount of prose, poetry have been written in Bengali language and large number of newspapers in Bengali get published daily, technically it is still considered a Low Resource Language (LRL) unlike English or French which are High Resource Language (HRL). The reason is not far to seek as corpora in varied domains such as short stories, sports, politics, agriculture etc is less in number and even when they are available, the size is less. Machine translation (MT) is difficult to perform in Bengali as parallel corpora from Bengali to other languages and vice versa is few and far between and when they are available they suffer from the problems of size and quality. This work is aimed at implementing one state of the art model in Neural Machine Translation (NMT) which is called the self-attention transformer model to perform translation from English to Bengali and vice versa. Though a couple of research work has been published in the recent years on MT from English to Bengali, they are mostly domain specific. This paper does not focus on any specific domain for NMT from English to Bengali and as such may be conceived as a more of general domain NMT from English to Bengali which is more difficult than domain specific NMT. Performance evaluation of the model was done  using BLEU version-4  vis-à-vis translations of well known English-Bengali MTsystems

International Journal on Recent and Innovation Trends in Computing and Communication

Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

Author: EHRMANN MAUD
TURCHI MARCO
Publication venue: Centro de Innovación y Desarrollo Tecnológico en Cómputo, Instituto Politécnico Nacional, Mexico
Publication date: 09/08/2011
Field of study

Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

JRC Publications Repository

Venetan to English machine translation: issues and possible solutions

Author: DELMONTE R.
SUHEL JABER
TONELLI SARA
Publication venue: University of Copenhagen, Special Issue
Publication date: 01/01/2011
Field of study

In this paper we describe a prototype of a Venetan to English translation system developed under the STILVEN project financed by the Regional Authorities of Veneto Region in Italy. The general approach is a statistical one with some preprocessing operations both at training and translation time (ortographic normalization and POS tagging to make use of factored models) which are needed especially to overcome two main problems: the scarcity of Venetan resources (our Venetan-English corpus is made up of only 13,000 sentences, amounting to 128,000 Venetan tokens excluding punctuation) and the diasystemic nature of Venetan, which really represents an ensemble of varieties rather than a single dialect. We will present in detail the problems related to Venetan, our ideas to solve them, their implementation and the results obtained so far

Archivio Ricerca Ca'Foscari

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Evaluation Review on Effectiveness and Security Performances of Text Steganography Technique

Author: Din Roshidi
Mustapha Aida
Utama Sunariya
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/08/2018
Field of study

Steganography is one of the categories in information hiding that is implemented to conceal the hidden message to ensure it cannot be recognized by human vision. This paper focuses on steganography implementation in text domain namely text steganography.Text steganography consists of two groups, which are word-rule based and feature-based techniques.This paper analysed these two categories of text steganography based on effectiveness and security evaluation because the effectiveness is critically important in order to determine that technique has the appropriate quality.Meanwhile, the security is important due to the intensity performance in securing the hidden message. The main goal of this paper is to review the evaluation of text steganography in terms of effectiveness and security that have been developed by previous research efforts. It is anticipated that this paper will identify the performance of text steganography based on effectiveness and security measurement

UUM Repository

IAES journal