Search CORE

80,121 research outputs found

Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages

Author: Arcan Mihael
Chakravarthi Bharathi Raja
McCrae John P.
Publication venue: OASIcs - OpenAccess Series in Informatics. 2nd Conference on Language, Data and Knowledge (LDK 2019)
Publication date: 01/01/2019
Field of study

Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription

ZENODO

Dagstuhl Research Online Publication Server

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Improving Machine Translation Quality with Denoising Autoencoder and Pre-Ordering

Author: Hoang-Quan Nguyen
Hong-Viet Tran
Van-Vinh Nguyen
Publication venue: 'Faculty of Electrical Engineering and Computing, Univ. of Zagreb'
Publication date: 01/01/2021
Field of study

The problems in machine translation are related to the characteristics of a family of languages, especially syntactic divergences between languages. In the translation task, having both source and target languages in the same language family is a luxury that cannot be relied upon. The trained models for the task must overcome such differences either through manual augmentations or automatically inferred capacity built into the model design. In this work, we investigated the impact of multiple methods of differing word orders during translation and further experimented in assimilating the source languages syntax to the target word order using pre-ordering. We focused on the field of extremely low-resource scenarios. We also conducted experiments on practical data augmentation techniques that support the reordering capacity of the models through varying the target objectives, adding the secondary goal of removing noises or reordering broken input sequences. In particular, we propose methods to improve translat on quality with the denoising autoencoder in Neural Machine Translation (NMT) and pre-ordering method in Phrase-based Statistical Machine Translation (PBSMT). The experiments with a number of English-Vietnamese pairs show the improvement in BLEU scores as compared to both the NMT and SMT systems

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

Author: Barrón-Cedeño Alberto
España-Bonet Cristina
van Genabith Josef
Varga Ádám Csaba
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/11/2017
Field of study

End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words -or sentences- which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present work is two-fold. First, we systematically study the NMT context vectors, i.e. output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.Comment: 11 pages, 4 figure

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Translating Arabic as low resource language using distribution representation and neural machine translation models

Author: Almansor Ebtesam Hussain
Publication venue
Publication date: 01/01/2018
Field of study

University of Technology Sydney. Faculty of Engineering and Information Technology.Rapid growth in social media platforms makes the communication between users easier. According to that, the communication increased the importance of translating human languages. Machine translation technology has been widely used for translating several languages using different approaches such as rule based, statistical machine translation and more recently neural machine translation. The quality of machine translation depends on the availability of parallel datasets. Languages that lack sufficient datasets have posed many challenges related to their processing and analysis. These languages are referred to as low resource languages. In this research, we mainly focused on low resource languages, particularly Arabic and its dialects. Dialectal Arabic can be treated as non-standard text that is used in Arab social media and need to be translated to their standard forms. In this context, the importance and the focus of machine translation have been increased recently. Unlike English and other languages, translation of Arabic and its dialects have not been thoroughly investigated, where existing attempts were mostly developed based on statistic and rule-based approaches, while neural network approaches have hardly been considered. Therefore, a distribution representation model (embedding model) has been proposed to translate dialectal Arabic to Modern Standard Arabic. As Arabic is a rich morphology language that has different forms of the same words the proposed model can help to capture more linguistic features such as semantic and syntax features without any rules. Another benefit of the proposed model is that it has the capability to be trained on monolingual datasets instead of parallel datasets. This model was used to translate Egyptian dialect text to Modern Standard Arabic. We also, built a monolingual datasets from available resources and a small parallel dictionary. Different datasets were used to evaluate the performance of the proposed method. This research provides new insight into dialectal Arabic translation. Recently, there has been increased interest in Neural Machine Translation (NMT). NMT is a deep learning based model that is trained using large parallel datasets with the aim of mapping text from the source language to the target language. While it shows a promising result for high resource translation languages, such as English, low resource languages face challenges using NMT. Therefore, a number of NMT based models have been developed to translate low resource languages, for instance pre-trained models that utilize monolingual datasets. While these models were used on word level and using recurrent neural networks, which have some limitations, we proposed a hybrid model that combines recurrent and convolutional neural networks on character level to translate low resource languages

OPUS - University of Technology Sydney

Recommended from our members

Pattern Matching for Translating Domain-Specific Terms from Large Corpora

Author: Fung Pascale
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1995
Field of study

Translating domain-specific terms is one significant component of machine translation and Machine-aided translation systems. These terms are often not found in standard dictionaries. Human translators, not being experts in every technical or regional domain, cannot produce their translations effectively. Automatic translation of domain-specific terms is therefore highly desirable. Most other work on automatic term translation uses statistical information of words from parallel corpora. Parallel corpora of clean- translated texts are hard to come by whereas there are more noisy- translated texts and many more monolingual texts in various domains. We propose using noisy parallel texts and same-domain texts of a pair of languages to translate terms. In our work, we propose using a novel paradigm of pattern matching of statistical signals of word features. These features are robust to the syntactic structure, character sets, language of the text, and to the domain. We obtain statistical information which is related to the lexical properties of a word and its translation in any other language of the same domain. These lexical properties are extracted from the corpora and represented in vector form. We propose using signal processing techniques for matching these features vectors of a word to those of its translation. Another matching technique we propose is applying discriminative analysis of the word features. For each word, the various features are combined into a single vector which is then transformed into a smaller dimension eigenvector for matching. Since most domain specific terms are nouns and noun phrases, we concentrate on translating English nouns and noun phrases into other languages. We study the relationship between English noun phrases and their translations in Chinese, Japanese and French in parallel corpora. The result of this study is used in our system for translation of English noun phrases into these other languages from noisy parallel and non-parallel corpora

Columbia University Academic Commons

Improving Statistical Machine Translation Using Comparable Corpora

Author: Snover Matthew Garvey
Publication venue
Publication date: 01/01/2010
Field of study

With thousands of languages in the world, and the increasing speed and quantity of information being distributed across the world, automatic translation between languages by computers, Machine Translation (MT), has become an increasingly important area of research. State-of-the-art MT systems rely not upon hand-crafted translation rules written by human experts, but rather on learned statistical models that translate a source language to a target language. These models are typically generated from large, parallel corpora containing copies of text in both the source and target languages. The co-occurrence of words across languages in parallel corpora allows the creation of translation rules that specify the probability of translating words or phrases from one language to the other. Monolingual corpora, containing text only in one language--primarily the target language--are not used to model the translation process, but are used to better model the structure of the target language. Unlike parallel data, which require expensive human translators to generate, monolingual data are cheap and widely available. Similar topics and events to those in a source document that is being translated often occur in documents in a comparable monolingual corpus. In much the same way that a human translator would use world knowledge to aid translation, the MT system may be able to use these relevant documents from comparable corpora to guide translation by biasing the translation system to produce output more similar to the relevant documents. This thesis seeks to answer the following questions: (1) Is it possible to improve a modern, state-of-the-art translation system by biasing the MT output to be more similar to relevant passages from comparable monolingual text? (2) What level of similarity is necessary to exploit these techniques? (3) What is the nature of the relevant passages that are needed during the application of these techniques? To answer these questions, this thesis describes a method for generating new translation rules from monolingual data specifically targeted for the document that is being translated. Rule generation leverages the existing translation system and topical overlap between the foreign source text and the monolingual text, and unlike regular translation rule generation does not require parallel text. For each source document to be translated, potentially comparable documents are selected from the monolingual data using cross-lingual information retrieval. By biasing the MT system towards the selected relevant documents and then measuring the similarity of the biased output to the relevant documents using Translation Edit Rate Plus (TERp), it is possible to identify sub-sentential regions of the source and comparable documents that are possible translations of each other. This process results in the generation of new translation rules, where the source side is taken from the document to be translated and the target side is fluent target language text taken from the monolingual data. The use of these rules results in improvements over a state-of-the-art statistical translation system. These techniques are most effective when there is a high degree of similarity between the source and relevant passages--such as when they report on the same new stories--but some benefit, approximately half, can be achieved when the passages are only historically or topically related. The discovery of the feasibility of improving MT by using comparable passages to bias MT output provides a basis for future investigation on problems of this type. Ultimately, the goal is to provide a framework within which translation rules may be generated without additional parallel corpora, thus allowing researchers to test longstanding hypotheses about machine translation in the face of scarce parallel resources

Digital Repository at the University of Maryland