12 research outputs found
HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation
We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task
Development of Focused Crawlers for Building Large Punjabi News Corpus
Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes
The Effect of Iconicity Flash Blindness—An Empirical Study
In our experiment, the Saussurean postulate of arbitrariness has been empirically tested in order to see whether this postulate can be applied to all words to the same extent. Three hundred participants were asked to match Czech words with their Hindi translations. One set of words was randomly chosen from a Hindi corpus (set A); the second set consisted of both randomly chosen words and words categorized as ideophones (set B). The participants were successful in matching both sets (the lower level of the confidence interval is about 7% above random guessing), and their performance showed unexpected patterns: For one, not only iconic properties (the sound qualities) but also iconicity itself is an important distinctive feature and recipients are able to exploit this. Moreover, even words considered to be non-iconic (set A) apparently contain a degree of iconicity, which participants are able to draw upon. However, participants appear to lose this ability when non-iconic words are presented in the context of words with evident and abundant iconicity (set B). The effect resembles the accommodation process which is known for other senses; therefore, we call the effect “Iconicity flash blindness”
Multimodal neural machine translation for low-resource language pairs using synthetic data
In this paper, we investigate the effectiveness of training a multimodal neural machine translation (MNMT) system with image features for a lowresource language pair, Hindi and English, using synthetic data. A threeway parallel corpus which contains
bilingual texts and corresponding images is required to train a MNMT system with image features. However,
such a corpus is not available for low resource language pairs. To address this,
we developed both a synthetic training dataset and a manually curated development/test dataset for Hindi based
on an existing English-image parallel
corpus. We used these datasets to
build our image description translation system by adopting state-of-theart MNMT models. Our results show
that it is possible to train a MNMT
system for low-resource language pairs
through the use of synthetic data and
that such a system can benefit from image features
Temporality as seen through translation: a case study on Hindi texts
Temporality has significantly contributed to various aspects of Natural Language
Processing applications. In this paper, we determine the extent to which temporal
orientation is preserved when a sentence is translated manually and automatically
from the Hindi language to the English language. We show that the manually and
automatically identified temporal orientation in English translated (both manual
and automatic) sentences provides a good match with the temporal orientation of
the Hindi texts. We also find that the task of manual temporal annotation becomes
difficult in the translated texts while the automatic temporal processing system manages to correctly capture temporal information from the translations
TermEval: an automatic metric for evaluating terminology translation in MT
Terminology translation plays a crucial role in domain-specific machine translation (MT). Preservation of domain-knowledge from source to target is arguably the most concerning factor for the customers in translation industry, especially for critical domains such as medical, transportation, military, legal and aerospace. However, evaluation of terminology translation, despite its huge importance in the translation industry, has been a less examined area in MT research. Term translation quality in MT is usually measured with domain experts, either in academia or industry. To the best of our knowledge, as of yet there is no publicly available solution to automatically evaluate terminology translation in MT. In particular, manual intervention is often needed to evaluate terminology translation in MT, which, by nature, is a time-consuming and highly expensive task. In fact, this is unimaginable in an industrial setting where customised MT systems are often needed to be updated for many reasons (e.g. availability of new training data or leading MT techniques). Hence, there is a genuine need to have a faster and less expensive solution to this problem,
which could aid the end-users to instantly identify term translation problems in MT.
In this study, we propose an automatic evaluation metric, TermEval, for evaluating terminology translation in MT. To the best of our knowledge, there is no gold-standard dataset available for measuring terminology translation quality in MT. In the absence of gold standard evaluation test set, we semi-automatically create a gold-standard dataset from English--Hindi judicial domain parallel corpus.
We trained state-of-the-art phrase-based SMT (PB-SMT) and neural MT (NMT) models on two translation directions: English-to-Hindi and Hindi-to-English, and use TermEval to evaluate their performance on terminology translation over the created gold standard test set. In order to measure the correlation between TermEval scores and human judgments, translations of each source terms (of the gold standard test set) is validated with human evaluator. High correlation between TermEval and human judgements manifests the effectiveness of the proposed terminology translation evaluation metric. We also carry out comprehensive manual evaluation on terminology translation and present our observations
TermEval: an automatic metric for evaluating terminology translation in MT
Terminology translation plays a crucial role in domain-specific machine translation (MT). Preservation of domain-knowledge from source to target
is arguably the most concerning factor for the customers in translation industry,
especially for critical domains such as medical, transportation, military, legal and
aerospace. However, evaluation of terminology translation, despite its huge importance in the translation industry, has been a less examined area in MT research.
Term translation quality in MT is usually measured with domain experts, either in
academia or industry. To the best of our knowledge, as of yet there is no publicly
available solution to automatically evaluate terminology translation in MT. In particular, manual intervention is often needed to evaluate terminology translation
in MT, which, by nature, is a time-consuming and highly expensive task. In fact,
this is unimaginable in an industrial setting where customised MT systems are
often needed to be updated for many reasons (e.g. availability of new training data
or leading MT techniques). Hence, there is a genuine need to have a faster and
less expensive solution to this problem, which could aid the end-users to instantly
identify term translation problems in MT. In this study, we propose an automatic
evaluation metric, TermEval, for evaluating terminology translation in MT. To the
best of our knowledge, there is no gold-standard dataset available for measuring
terminology translation quality in MT. In the absence of gold standard evaluation
test set, we semi-automatically create a gold-standard dataset from English–Hindi
judicial domain parallel corpus.
We trained state-of-the-art phrase-based SMT (PB-SMT) and neural MT (NMT)
models on two translation directions: English-to-Hindi and Hindi-to-English, and
use TermEval to evaluate their performance on terminology translation over the
created gold standard test set. In order to measure the correlation between TermEval scores and human judgments, translations of each source terms (of the gold
standard test set) is validated with human evaluator. High correlation between
TermEval and human judgements manifests the effectiveness of the proposed terminology translation evaluation metric. We also carry out comprehensive manual
evaluation on terminology translation and present our observations