17 research outputs found
A Report on the Complex Word Identification Shared Task 2018
We report the findings of the second Complex Word Identification (CWI) shared
task organized as part of the BEA workshop co-located with NAACL-HLT'2018. The
second CWI shared task featured multilingual and multi-genre datasets divided
into four tracks: English monolingual, German monolingual, Spanish monolingual,
and a multilingual track with a French test set, and two tasks: binary
classification and probabilistic classification. A total of 12 teams submitted
their results in different task/track combinations and 11 of them wrote system
description papers that are referred to in this report and appear in the BEA
workshop proceedings.Comment: Second CWI Shared Task co-located with the BEA Workshop 2018 at
NAACL-HLT in New Orleans, US
Demonstrating PAR4SEM - A Semantic Writing Aid with Adaptive Paraphrasing
In this paper, we present Par4Sem, a semantic writing aid tool based on
adaptive paraphrasing. Unlike many annotation tools that are primarily used to
collect training examples, Par4Sem is integrated into a real word application,
in this case a writing aid tool, in order to collect training examples from
usage data. Par4Sem is a tool, which supports an adaptive, iterative, and
interactive process where the underlying machine learning models are updated
for each iteration using new training examples from usage data. After
motivating the use of ever-learning tools in NLP applications, we evaluate
Par4Sem by adopting it to a text simplification task through mere usage.Comment: EMNLP Demo pape
CompLex: A New Corpus for Lexical Complexity Prediction from Likert Scale Data
Predicting which words are considered hard to understand for a given target
population is a vital step in many NLP applications such as text
simplification. This task is commonly referred to as Complex Word
Identification (CWI). With a few exceptions, previous studies have approached
the task as a binary classification task in which systems predict a complexity
value (complex vs. non-complex) for a set of target words in a text. This
choice is motivated by the fact that all CWI datasets compiled so far have been
annotated using a binary annotation scheme. Our paper addresses this limitation
by presenting the first English dataset for continuous lexical complexity
prediction. We use a 5-point Likert scale scheme to annotate complex words in
texts from three sources/domains: the Bible, Europarl, and biomedical texts.
This resulted in a corpus of 9,476 sentences each annotated by around 7
annotators.Comment: Proceedings of the 1st Workshop on Tools and Resources to Empower
People with REAding DIfficulties (READI). pp. 57-6
UnibucKernel: A kernel-based learning method for complex word identification
In this paper, we present a kernel-based learning approach for the 2018
Complex Word Identification (CWI) Shared Task. Our approach is based on
combining multiple low-level features, such as character n-grams, with
high-level semantic features that are either automatically learned using word
embeddings or extracted from a lexical knowledge base, namely WordNet. After
feature extraction, we employ a kernel method for the learning phase. The
feature matrix is first transformed into a normalized kernel matrix. For the
binary classification task (simple versus complex), we employ Support Vector
Machines. For the regression task, in which we have to predict the complexity
level of a word (a word is more complex if it is labeled as complex by more
annotators), we employ v-Support Vector Regression. We applied our approach
only on the three English data sets containing documents from Wikipedia,
WikiNews and News domains. Our best result during the competition was the third
place on the English Wikipedia data set. However, in this paper, we also report
better post-competition results.Comment: This paper presents the system developed by the UnibucKernel team for
the 2018 CWI Shared Task. Accepted at the BEA13 Workshop of NAACL 201
Detecting Multiword Expression Type Helps Lexical Complexity Assessment
Multiword expressions (MWEs) represent lexemes that should be treated as
single lexical units due to their idiosyncratic nature. Multiple NLP
applications have been shown to benefit from MWE identification, however the
research on lexical complexity of MWEs is still an under-explored area. In this
work, we re-annotate the Complex Word Identification Shared Task 2018 dataset
of Yimam et al. (2017), which provides complexity scores for a range of
lexemes, with the types of MWEs. We release the MWE-annotated dataset with this
paper, and we believe this dataset represents a valuable resource for the text
simplification community. In addition, we investigate which types of
expressions are most problematic for native and non-native readers. Finally, we
show that a lexical complexity assessment system benefits from the information
about MWE types.Comment: Accepted for publication at LREC 202
A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification
Current lexical simplification approaches rely heavily on heuristics and
corpus level features that do not always align with human judgment. We create a
human-rated word-complexity lexicon of 15,000 English words and propose a novel
neural readability ranking model with a Gaussian-based feature vectorization
layer that utilizes these human ratings to measure the complexity of any given
word or phrase. Our model performs better than the state-of-the-art systems for
different lexical simplification tasks and evaluation datasets. Additionally,
we also produce SimplePPDB++, a lexical resource of over 10 million simplifying
paraphrase rules, by applying our model to the Paraphrase Database (PPDB).Comment: 12 pages; EMNLP 201
OCHADAI-KYOTO at SemEval-2021 Task 1: Enhancing Model Generalization and Robustness for Lexical Complexity Prediction
We propose an ensemble model for predicting the lexical complexity of words
and multiword expressions (MWEs). The model receives as input a sentence with a
target word or MWEand outputs its complexity score. Given that a key challenge
with this task is the limited size of annotated data, our model relies on
pretrained contextual representations from different state-of-the-art
transformer-based language models (i.e., BERT and RoBERTa), and on a variety of
training methods for further enhancing model generalization and
robustness:multi-step fine-tuning and multi-task learning, and adversarial
training. Additionally, we propose to enrich contextual representations by
adding hand-crafted features during training. Our model achieved competitive
results and ranked among the top-10 systems in both sub-tasks
Par4Sim -- Adaptive Paraphrasing for Text Simplification
Learning from a real-world data stream and continuously updating the model
without explicit supervision is a new challenge for NLP applications with
machine learning components. In this work, we have developed an adaptive
learning system for text simplification, which improves the underlying
learning-to-rank model from usage data, i.e. how users have employed the system
for the task of simplification. Our experimental result shows that, over a
period of time, the performance of the embedded paraphrase ranking model
increases steadily improving from a score of 62.88% up to 75.70% based on the
NDCG@10 evaluation metrics. To our knowledge, this is the first study where an
NLP component is adaptively improved through usage.Comment: COLING 2018 main conferenc
Automatic Compilation of Resources for Academic Writing and Evaluating with Informal Word Identification and Paraphrasing System
We present the first approach to automatically building resources for
academic writing. The aim is to build a writing aid system that automatically
edits a text so that it better adheres to the academic style of writing. On top
of existing academic resources, such as the Corpus of Contemporary American
English (COCA) academic Word List, the New Academic Word List, and the Academic
Collocation List, we also explore how to dynamically build such resources that
would be used to automatically identify informal or non-academic words or
phrases. The resources are compiled using different generic approaches that can
be extended for different domains and languages. We describe the evaluation of
resources with a system implementation. The system consists of an informal word
identification (IWI), academic candidate paraphrase generation, and paraphrase
ranking components. To generate candidates and rank them in context, we have
used the PPDB and WordNet paraphrase resources. We use the Concepts in Context
(CoInCO) "All-Words" lexical substitution dataset both for the informal word
identification and paraphrase generation experiments. Our informal word
identification component achieves an F-1 score of 82%, significantly
outperforming a stratified classifier baseline. The main contribution of this
work is a domain-independent methodology to build targeted resources for
writing aids
LSBert: A Simple Framework for Lexical Simplification
Lexical simplification (LS) aims to replace complex words in a given sentence
with their simpler alternatives of equivalent meaning, to simplify the
sentence. Recently unsupervised lexical simplification approaches only rely on
the complex word itself regardless of the given sentence to generate candidate
substitutions, which will inevitably produce a large number of spurious
candidates. In this paper, we propose a lexical simplification framework LSBert
based on pretrained representation model Bert, that is capable of (1) making
use of the wider context when both detecting the words in need of
simplification and generating substitue candidates, and (2) taking five
high-quality features into account for ranking candidates, including Bert
prediction order, Bert-based language model, and the paraphrase database PPDB,
in addition to the word frequency and word similarity commonly used in other LS
methods. We show that our system outputs lexical simplifications that are
grammatically correct and semantically appropriate, and obtains obvious
improvement compared with these baselines, outperforming the state-of-the-art
by 29.8 Accuracy points on three well-known benchmarks.Comment: arXiv admin note: text overlap with arXiv:1907.0622