12 research outputs found
Lexical Simplification with Pretrained Encoders
Lexical simplification (LS) aims to replace complex words in a given sentence
with their simpler alternatives of equivalent meaning. Recently unsupervised
lexical simplification approaches only rely on the complex word itself
regardless of the given sentence to generate candidate substitutions, which
will inevitably produce a large number of spurious candidates. We present a
simple LS approach that makes use of the Bidirectional Encoder Representations
from Transformers (BERT) which can consider both the given sentence and the
complex word during generating candidate substitutions for the complex word.
Specifically, we mask the complex word of the original sentence for feeding
into the BERT to predict the masked token. The predicted results will be used
as candidate substitutions. Despite being entirely unsupervised, experimental
results show that our approach obtains obvious improvement compared with these
baselines leveraging linguistic databases and parallel corpus, outperforming
the state-of-the-art by more than 12 Accuracy points on three well-known
benchmarks
Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders
Text simplification (TS) rephrases long sentences into simplified variants
while preserving inherent semantics. Traditional sequence-to-sequence models
heavily rely on the quantity and quality of parallel sentences, which limits
their applicability in different languages and domains. This work investigates
how to leverage large amounts of unpaired corpora in TS task. We adopt the
back-translation architecture in unsupervised machine translation (NMT),
including denoising autoencoders for language modeling and automatic generation
of parallel data by iterative back-translation. However, it is non-trivial to
generate appropriate complex-simple pair if we directly treat the set of simple
and complex corpora as two different languages, since the two types of
sentences are quite similar and it is hard for the model to capture the
characteristics in different types of sentences. To tackle this problem, we
propose asymmetric denoising methods for sentences with separate complexity.
When modeling simple and complex sentences with autoencoders, we introduce
different types of noise into the training process. Such a method can
significantly improve the simplification performance. Our model can be trained
in both unsupervised and semi-supervised manner. Automatic and human
evaluations show that our unsupervised model outperforms the previous systems,
and with limited supervision, our model can perform competitively with multiple
state-of-the-art simplification systems
Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Words
Conference paper: Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Word
平易なコーパスを用いないテキスト平易化
首都大学東京, 2018-03-25, 博士(工学)首都大学東
An Automatic Modern Standard Arabic Text Simplification System: A Corpus-Based Approach
This thesis brings together an overview of Text Readability (TR) about Text Simplification (TS) with an application of both to Modern Standard Arabic (MSA). It will present our findings on using automatic TR and TS tools to teach MSA, along with challenges, limitations, and recommendations about enhancing the TR and TS models.
Reading is one of the most vital tasks that provide language input for communication and comprehension skills. It is proved that the use of long sentences, connected sentences, embedded phrases, passive voices, non- standard word orders, and infrequent words can increase the text difficulty for people with low literacy levels, as well as second language learners. The thesis compares the use of sentence embeddings of different types (fastText, mBERT, XLM-R and Arabic-BERT), as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners. The accuracy of the 3-way CEFR (The Common European Framework of Reference for Languages Proficiency Levels) classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification, respectively and 0.71 Spearman correlation for the regression task. At the same time, the binary difficulty classifier reaches F-1 0.94 and F-1 0.98 for the sentence-pair semantic similarity classifier.
TS is an NLP task aiming to reduce the linguistic complexity of the text while maintaining its meaning and original information (Siddharthan, 2002; Camacho Collados, 2013; Saggion, 2017). The simplification study experimented using two approaches: (i) a classification approach and (ii) a generative approach. It then evaluated the effectiveness of these methods using the BERTScore (Zhang et al., 2020) evaluation metric. The simple sentences produced by the mT5 model achieved P 0.72, R 0.68 and F-1 0.70 via BERTScore while combining Arabic- BERT and fastText achieved P 0.97, R 0.97 and F-1 0.97.
To reiterate, this research demonstrated the effectiveness of the implementation of a corpus-based method combined with extracting extensive linguistic features via the latest NLP techniques. It provided insights which can be of use in various Arabic corpus studies and NLP tasks such as translation for educational purposes
Natural Language Processing for Technology Foresight Summarization and Simplification: the case of patents
Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring.
Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon.
This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning).
We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless.
We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain.
We also explore transferring summarization methods from the scientific paper domain with limited success.
For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model.
This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts.Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring.
Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon.
This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning).
We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless.
We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain.
We also explore transferring summarization methods from the scientific paper domain with limited success.
For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model.
This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts