Search CORE

5 research outputs found

LaSTUS/TALN at Complex Word Identification (CWI) 2018 Shared Task

Author: AbuRa’ed Ahmed
Saggion Horacio
Publication venue: ACL (Association for Computational Linguistics)
Publication date
Field of study

Comunicació presentada al 13th Workshop on Innovative Use of NLP for Building Educational Applications, celebrat el dia 5 de juny de 2018 a Nova Orleans, EUA.This paper presents the participation of the LaSTUS/TALN team in the Complex Word Identification (CWI) Shared Task 2018 in the English monolingual track . The purpose of the task was to determine if a word in a given sentence can be judged as complex or not by a certain target audience. For the English track, task organizers provided a training and a development datasets of 27,299 and 3,328 words respectively together with the sentence in which each word occurs. The words were judged as complex or not by 20 human evaluators; ten of whom are natives. We submitted two systems: one system modeled each word to evaluate as a numeric vector populated with a set of lexical, semantic and contextual features while the other system relies on a word embedding representation and a distance metric. We trained two separate classifiers to automatically decide if each word is complex or not. We submitted six runs, two for each of the three subsets of the English monolingual CWI track.This work is (partly) supported by the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502) and by the TUNER project (TIN2015-65308-C5-5-R, MINECO/FEDER, UE)

RECERCAT

LaSTUS/TALN at Complex Word Identification (CWI) 2018 Shared Task

Author: AbuRa'ed Ahmed Ghassan Tawfiq
Saggion Horacio
Publication venue: ACL (Association for Computational Linguistics)
Publication date: 01/01/2018
Field of study

Crossref

UPF Digital Repository

Leveraging contextual representations with BiLSTM-based regressor for lexical complexity prediction

Author: Aono Masaki
Aziz Abdul
Chy Abu Nowshed
Hossain Md. Akram
Ullah Md. Zia
Publication venue: Elsevier
Publication date
Field of study

Lexical complexity prediction (LCP) determines the complexity level of words or phrases in a sentence. LCP has a significant impact on the enhancement of language translations, readability assessment, and text generation. However, the domain-specific technical word, the complex grammatical structure, the polysemy problem, the inter-word relationship, and dependencies make it challenging to determine the complexity of words or phrases. In this paper, we propose an integrated transformer regressor model named ITRM-LCP to estimate the lexical complexity of words and phrases where diverse contextual features are extracted from various transformer models. The transformer models are fine-tuned using the text-pair data. Then, a bidirectional LSTM-based regressor module is plugged on top of each transformer to learn the long-term dependencies and estimate the complexity scores. The predicted scores of each module are then aggregated to determine the final complexity score. We assess our proposed model using two benchmark datasets from shared tasks. Experimental findings demonstrate that our ITRM-LCP model obtains 10.2% and 8.2% improvement on the news and Wikipedia corpus of the CWI-2018 dataset, compared to the top-performing systems (DAT, CAMB, and TMU). Additionally, our ITRM-LCP model surpasses state-of-the-art LCP systems (DeepBlueAI, JUST-BLUE) by 1.5% and 1.34% for single and multi-word LCP tasks defined in the SemEval LCP-2021 task

Repository@Napier

Predicting lexical complexity in English texts: the Complex 2.0 dataset

Author: Evans Richard
Shardlow Matthew
Zampieri Marcos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/02/2021
Field of study

© 2022 The Authors. Published by Springer. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://doi.org/10.1007/s10579-022-09588-2Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction

arXiv.org e-Print Archive

E-space: Manchester Metropolitan University's Research Repository

Wolverhampton Intellectual Repository and E-theses

Lexical complexity prediction: an overview

Author: North Kai
Shardlow Matthew
Zampieri Marcos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/09/2023
Field of study

The occurrence of unknown words in texts significantly hinders reading comprehension. To improve accessibility for specific target populations, computational modeling has been applied to identify complex words in texts and substitute them for simpler alternatives. In this article, we present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data. We survey relevant approaches to this problem which include traditional machine learning classifiers (e.g., SVMs, logistic regression) and deep neural networks as well as a variety of features, such as those inspired by literature in psycholinguistics as well as word frequency, word length, and many others. Furthermore, we introduce readers to past competitions and available datasets created on this topic. Finally, we include brief sections on applications of lexical complexity prediction, such as readability and text simplification, together with related studies on languages other than English

E-space: Manchester Metropolitan University's Research Repository