12 research outputs found

    Lexical Simplification with Pretrained Encoders

    Full text link
    Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning. Recently unsupervised lexical simplification approaches only rely on the complex word itself regardless of the given sentence to generate candidate substitutions, which will inevitably produce a large number of spurious candidates. We present a simple LS approach that makes use of the Bidirectional Encoder Representations from Transformers (BERT) which can consider both the given sentence and the complex word during generating candidate substitutions for the complex word. Specifically, we mask the complex word of the original sentence for feeding into the BERT to predict the masked token. The predicted results will be used as candidate substitutions. Despite being entirely unsupervised, experimental results show that our approach obtains obvious improvement compared with these baselines leveraging linguistic databases and parallel corpus, outperforming the state-of-the-art by more than 12 Accuracy points on three well-known benchmarks

    Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders

    Full text link
    Text simplification (TS) rephrases long sentences into simplified variants while preserving inherent semantics. Traditional sequence-to-sequence models heavily rely on the quantity and quality of parallel sentences, which limits their applicability in different languages and domains. This work investigates how to leverage large amounts of unpaired corpora in TS task. We adopt the back-translation architecture in unsupervised machine translation (NMT), including denoising autoencoders for language modeling and automatic generation of parallel data by iterative back-translation. However, it is non-trivial to generate appropriate complex-simple pair if we directly treat the set of simple and complex corpora as two different languages, since the two types of sentences are quite similar and it is hard for the model to capture the characteristics in different types of sentences. To tackle this problem, we propose asymmetric denoising methods for sentences with separate complexity. When modeling simple and complex sentences with autoencoders, we introduce different types of noise into the training process. Such a method can significantly improve the simplification performance. Our model can be trained in both unsupervised and semi-supervised manner. Automatic and human evaluations show that our unsupervised model outperforms the previous systems, and with limited supervision, our model can perform competitively with multiple state-of-the-art simplification systems

    Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Words

    Get PDF
    Conference paper: Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Word

    平易なコーパスを用いないテキスト平易化

    Get PDF
    首都大学東京, 2018-03-25, 博士(工学)首都大学東

    An Automatic Modern Standard Arabic Text Simplification System: A Corpus-Based Approach

    Get PDF
    This thesis brings together an overview of Text Readability (TR) about Text Simplification (TS) with an application of both to Modern Standard Arabic (MSA). It will present our findings on using automatic TR and TS tools to teach MSA, along with challenges, limitations, and recommendations about enhancing the TR and TS models. Reading is one of the most vital tasks that provide language input for communication and comprehension skills. It is proved that the use of long sentences, connected sentences, embedded phrases, passive voices, non- standard word orders, and infrequent words can increase the text difficulty for people with low literacy levels, as well as second language learners. The thesis compares the use of sentence embeddings of different types (fastText, mBERT, XLM-R and Arabic-BERT), as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners. The accuracy of the 3-way CEFR (The Common European Framework of Reference for Languages Proficiency Levels) classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification, respectively and 0.71 Spearman correlation for the regression task. At the same time, the binary difficulty classifier reaches F-1 0.94 and F-1 0.98 for the sentence-pair semantic similarity classifier. TS is an NLP task aiming to reduce the linguistic complexity of the text while maintaining its meaning and original information (Siddharthan, 2002; Camacho Collados, 2013; Saggion, 2017). The simplification study experimented using two approaches: (i) a classification approach and (ii) a generative approach. It then evaluated the effectiveness of these methods using the BERTScore (Zhang et al., 2020) evaluation metric. The simple sentences produced by the mT5 model achieved P 0.72, R 0.68 and F-1 0.70 via BERTScore while combining Arabic- BERT and fastText achieved P 0.97, R 0.97 and F-1 0.97. To reiterate, this research demonstrated the effectiveness of the implementation of a corpus-based method combined with extracting extensive linguistic features via the latest NLP techniques. It provided insights which can be of use in various Arabic corpus studies and NLP tasks such as translation for educational purposes

    Natural Language Processing for Technology Foresight Summarization and Simplification: the case of patents

    Get PDF
    Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring. Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon. This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning). We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless. We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain. We also explore transferring summarization methods from the scientific paper domain with limited success. For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model. This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts.Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring. Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon. This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning). We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless. We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain. We also explore transferring summarization methods from the scientific paper domain with limited success. For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model. This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts
    corecore