6 research outputs found

    Evaluating prose style transfer with the Bible

    Get PDF
    In the prose style transfer task a system, provided with text input and a target prose style, produces output which preserves the meaning of the input text but alters the style. These systems require parallel data for evaluation of results and usually make use of parallel data for training. Currently, there are few publicly available corpora for this task. In this work, we identify a high-quality source of aligned, stylistically distinct text in different versions of the Bible. We provide a standardized split, into training, development and testing data, of the public domain versions in our corpus. This corpus is highly parallel since many Bible versions are included. Sentences are aligned due to the presence of chapter and verse numbers within all versions of the text. In addition to the corpus, we present the results, as measured by the BLEU and PINC metrics, of several models trained on our data which can serve as baselines for future research. While we present these data as a style transfer corpus, we believe that it is of unmatched quality and may be useful for other natural language tasks as well

    Text Style Transfer: A Review and Experimental Evaluation

    Full text link
    The stylistic properties of text have intrigued computational linguistics researchers in recent years. Specifically, researchers have investigated the Text Style Transfer (TST) task, which aims to change the stylistic properties of the text while retaining its style independent content. Over the last few years, many novel TST algorithms have been developed, while the industry has leveraged these algorithms to enable exciting TST applications. The field of TST research has burgeoned because of this symbiosis. This article aims to provide a comprehensive review of recent research efforts on text style transfer. More concretely, we create a taxonomy to organize the TST models and provide a comprehensive summary of the state of the art. We review the existing evaluation methodologies for TST tasks and conduct a large-scale reproducibility study where we experimentally benchmark 19 state-of-the-art TST algorithms on two publicly available datasets. Finally, we expand on current trends and provide new perspectives on the new and exciting developments in the TST field

    Understanding and Enhancing the Use of Context for Machine Translation

    Get PDF
    To understand and infer meaning in language, neural models have to learn complicated nuances. Discovering distinctive linguistic phenomena from data is not an easy task. For instance, lexical ambiguity is a fundamental feature of language which is challenging to learn. Even more prominently, inferring the meaning of rare and unseen lexical units is difficult with neural networks. Meaning is often determined from context. With context, languages allow meaning to be conveyed even when the specific words used are not known by the reader. To model this learning process, a system has to learn from a few instances in context and be able to generalize well to unseen cases. The learning process is hindered when training data is scarce for a task. Even with sufficient data, learning patterns for the long tail of the lexical distribution is challenging. In this thesis, we focus on understanding certain potentials of contexts in neural models and design augmentation models to benefit from them. We focus on machine translation as an important instance of the more general language understanding problem. To translate from a source language to a target language, a neural model has to understand the meaning of constituents in the provided context and generate constituents with the same meanings in the target language. This task accentuates the value of capturing nuances of language and the necessity of generalization from few observations. The main problem we study in this thesis is what neural machine translation models learn from data and how we can devise more focused contexts to enhance this learning. Looking more in-depth into the role of context and the impact of data on learning models is essential to advance the NLP field. Moreover, it helps highlight the vulnerabilities of current neural networks and provides insights into designing more robust models.Comment: PhD dissertation defended on November 10th, 202

    非負値行列因子分解アルゴリズムに基づくメッセージ特徴の選択手法に関する研究

    Get PDF
    筑波大学 (University of Tsukuba)201

    An Automatic Modern Standard Arabic Text Simplification System: A Corpus-Based Approach

    Get PDF
    This thesis brings together an overview of Text Readability (TR) about Text Simplification (TS) with an application of both to Modern Standard Arabic (MSA). It will present our findings on using automatic TR and TS tools to teach MSA, along with challenges, limitations, and recommendations about enhancing the TR and TS models. Reading is one of the most vital tasks that provide language input for communication and comprehension skills. It is proved that the use of long sentences, connected sentences, embedded phrases, passive voices, non- standard word orders, and infrequent words can increase the text difficulty for people with low literacy levels, as well as second language learners. The thesis compares the use of sentence embeddings of different types (fastText, mBERT, XLM-R and Arabic-BERT), as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners. The accuracy of the 3-way CEFR (The Common European Framework of Reference for Languages Proficiency Levels) classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification, respectively and 0.71 Spearman correlation for the regression task. At the same time, the binary difficulty classifier reaches F-1 0.94 and F-1 0.98 for the sentence-pair semantic similarity classifier. TS is an NLP task aiming to reduce the linguistic complexity of the text while maintaining its meaning and original information (Siddharthan, 2002; Camacho Collados, 2013; Saggion, 2017). The simplification study experimented using two approaches: (i) a classification approach and (ii) a generative approach. It then evaluated the effectiveness of these methods using the BERTScore (Zhang et al., 2020) evaluation metric. The simple sentences produced by the mT5 model achieved P 0.72, R 0.68 and F-1 0.70 via BERTScore while combining Arabic- BERT and fastText achieved P 0.97, R 0.97 and F-1 0.97. To reiterate, this research demonstrated the effectiveness of the implementation of a corpus-based method combined with extracting extensive linguistic features via the latest NLP techniques. It provided insights which can be of use in various Arabic corpus studies and NLP tasks such as translation for educational purposes
    corecore