975 research outputs found
Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level
Text alignment and text quality are critical to the accuracy of Machine
Translation (MT) systems, some NLP tools, and any other text processing tasks
requiring bilingual data. This research proposes a language independent
bi-sentence filtering approach based on Polish (not a position-sensitive
language) to English experiments. This cleaning approach was developed on the
TED Talks corpus and also initially tested on the Wikipedia comparable corpus,
but it can be used for any text domain or language pair. The proposed approach
implements various heuristics for sentence comparison. Some of them leverage
synonyms and semantic and structural analysis of text as additional
information. Minimization of data loss was ensured. An improvement in MT system
score with text processed using the tool is discussed.Comment: arXiv admin note: text overlap with arXiv:1509.09093,
arXiv:1509.0888
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Translationese and post-editese : how comparable is comparable quality?
Whereas post-edited texts have been shown to be either of comparable quality to human translations or better, one study shows that people still seem to prefer human-translated texts. The idea of texts being inherently different despite being of high quality is not new. Translated texts, for example,are also different from original texts, a phenomenon referred to as ‘Translationese’. Research into Translationese has shown that, whereas humans cannot distinguish between translated and original text,computers have been trained to detect Translationesesuccessfully. It remains to be seen whether the same can be done for what we call Post-editese. We first establish whether humans are capable of distinguishing post-edited texts from human translations, and then establish whether it is possible to build a supervised machine-learning model that can distinguish between translated and post-edited text
Bridging the Domain Gap for Stance Detection for the Zulu language
Misinformation has become a major concern in recent last years given its
spread across our information sources. In the past years, many NLP tasks have
been introduced in this area, with some systems reaching good results on
English language datasets. Existing AI based approaches for fighting
misinformation in literature suggest automatic stance detection as an integral
first step to success. Our paper aims at utilizing this progress made for
English to transfers that knowledge into other languages, which is a
non-trivial task due to the domain gap between English and the target
languages. We propose a black-box non-intrusive method that utilizes techniques
from Domain Adaptation to reduce the domain gap, without requiring any human
expertise in the target language, by leveraging low-quality data in both a
supervised and unsupervised manner. This allows us to rapidly achieve similar
results for stance detection for the Zulu language, the target language in this
work, as are found for English. We also provide a stance detection dataset in
the Zulu language. Our experimental results show that by leveraging English
datasets and machine translation we can increase performances on both English
data along with other languages.Comment: accepted to Intellisy
- …