3,190 research outputs found
Multilingual Unsupervised Sentence Simplification
Progress in Sentence Simplification has been hindered by the lack of
supervised data, particularly in languages other than English. Previous work
has aligned sentences from original and simplified corpora such as English
Wikipedia and Simple English Wikipedia, but this limits corpus size, domain,
and language. In this work, we propose using unsupervised mining techniques to
automatically create training corpora for simplification in multiple languages
from raw Common Crawl web data. When coupled with a controllable generation
mechanism that can flexibly adjust attributes such as length and lexical
complexity, these mined paraphrase corpora can be used to train simplification
systems in any language. We further incorporate multilingual unsupervised
pretraining methods to create even stronger models and show that by training on
mined data rather than supervised corpora, we outperform the previous best
results. We evaluate our approach on English, French, and Spanish
simplification benchmarks and reach state-of-the-art performance with a totally
unsupervised approach. We will release our models and code to mine the data in
any language included in Common Crawl
Comparison and Adaptation of Automatic Evaluation Metrics for Quality Assessment of Re-Speaking
Re-speaking is a mechanism for obtaining high quality subtitles for use in
live broadcast and other public events. Because it relies on humans performing
the actual re-speaking, the task of estimating the quality of the results is
non-trivial. Most organisations rely on humans to perform the actual quality
assessment, but purely automatic methods have been developed for other similar
problems, like Machine Translation. This paper will try to compare several of
these methods: BLEU, EBLEU, NIST, METEOR, METEOR-PL, TER and RIBES. These will
then be matched to the human-derived NER metric, commonly used in re-speaking.Comment: Comparison and Adaptation of Automatic Evaluation Metrics for Quality
Assessment of Re-Speaking. arXiv admin note: text overlap with
arXiv:1509.0908
Exploiting Lexical Conceptual Structure for paraphrase generation
Abstract. Lexical Conceptual Structure (LCS) represents verbs as semantic structures with a limited number of semantic predicates. This paper attempts to exploit how LCS can be used to explain the regularities underlying lexical and syntactic paraphrases, such as verb alternation, compound word decomposition, and lexical derivation. We propose a paraphrase generation model which transforms LCSs of verbs, and then conduct an empirical experiment taking the paraphrasing of Japanese light-verb constructions as an example. Experimental results justify that syntactic and semantic properties of verbs encoded in LCS are useful to semantically constrain the syntactic transformation in paraphrase generation.
Three English Learner Assistance Systems Using Automatic Paraphrasing Techniques
We developed three systems based on automatic paraphrasing techniques to help English learners and English-language beginners. One system extracts personal error patterns in the user’s English usage. The second transforms English sentences containing the letters “l” and “r” into sentences containing fewer instances of these letters, which Japanese people have trouble pronouncing properly in English. This system could be used, for example, to transform a draft of a presentation that a Japanese speaker was to present to an audience. The third is an annotation system that provides definition sentences of difficult English words, making them easier to understand. We believe that these systems will be useful both for learners of English and in studies on second-language acquisition
The Circle of Meaning: From Translation to Paraphrasing and Back
The preservation of meaning between inputs and outputs is perhaps
the most ambitious and, often, the most elusive goal of systems
that attempt to process natural language. Nowhere is this goal of
more obvious importance than for the tasks of machine translation
and paraphrase generation. Preserving meaning between the input and
the output is paramount for both, the monolingual vs bilingual distinction
notwithstanding. In this thesis, I present a novel, symbiotic relationship
between these two tasks that I term the "circle of meaning''.
Today's statistical machine translation (SMT) systems require high
quality human translations for parameter tuning, in addition to
large bi-texts for learning the translation units. This parameter
tuning usually involves generating translations at different points
in the parameter space and obtaining feedback against human-authored
reference translations as to how good the translations. This feedback
then dictates what point in the parameter space should be explored
next. To measure this feedback, it is generally considered wise to have
multiple (usually 4) reference translations to avoid unfair penalization of translation
hypotheses which could easily happen given the large number of ways in which
a sentence can be translated from one language to another. However, this reliance on multiple reference translations
creates a problem since they are labor intensive and expensive to obtain.
Therefore, most current MT datasets only contain a single reference.
This leads to the problem of reference sparsity---the primary open problem
that I address in this dissertation---one that has a serious effect on the
SMT parameter tuning process.
Bannard and Callison-Burch (2005) were the first to provide a practical
connection between phrase-based statistical machine translation and paraphrase
generation. However, their technique is restricted to generating phrasal
paraphrases. I build upon their approach and augment a phrasal paraphrase
extractor into a sentential paraphraser with extremely broad coverage.
The novelty in this augmentation lies in the further strengthening of
the connection between statistical machine translation and paraphrase
generation; whereas Bannard and Callison-Burch only relied on SMT machinery
to extract phrasal paraphrase rules and stopped there, I take it a few
steps further and build a full English-to-English SMT system. This system
can, as expected, ``translate'' any English input sentence into a new English
sentence with the same degree of meaning preservation that exists in a bilingual
SMT system. In fact, being a state-of-the-art SMT system, it is able to generate
n-best "translations" for any given input sentence. This sentential
paraphraser, built almost entirely from existing SMT machinery, represents
the first 180 degrees of the circle of meaning.
To complete the circle, I describe a novel connection in the other direction.
I claim that the sentential paraphraser, once built in this fashion, can
provide a solution to the reference sparsity problem and, hence, be used
to improve the performance a bilingual SMT system. I discuss two different
instantiations of the sentential paraphraser and show several results that
provide empirical validation for this connection
Detecting Machine-Translated Text using Back Translation
Machine-translated text plays a crucial role in the communication of people
using different languages. However, adversaries can use such text for malicious
purposes such as plagiarism and fake review. The existing methods detected a
machine-translated text only using the text's intrinsic content, but they are
unsuitable for classifying the machine-translated and human-written texts with
the same meanings. We have proposed a method to extract features used to
distinguish machine/human text based on the similarity between the intrinsic
text and its back-translation. The evaluation of detecting translated sentences
with French shows that our method achieves 75.0% of both accuracy and F-score.
It outperforms the existing methods whose the best accuracy is 62.8% and the
F-score is 62.7%. The proposed method even detects more efficiently the
back-translated text with 83.4% of accuracy, which is higher than 66.7% of the
best previous accuracy. We also achieve similar results not only with F-score
but also with similar experiments related to Japanese. Moreover, we prove that
our detector can recognize both machine-translated and machine-back-translated
texts without the language information which is used to generate these machine
texts. It demonstrates the persistence of our method in various applications in
both low- and rich-resource languages.Comment: INLG 2019, 9 page
- …