5 research outputs found

    LaSTUS/TALN at Complex Word Identification (CWI) 2018 Shared Task

    No full text
    Comunicació presentada al 13th Workshop on Innovative Use of NLP for Building Educational Applications, celebrat el dia 5 de juny de 2018 a Nova Orleans, EUA.This paper presents the participation of the LaSTUS/TALN team in the Complex Word Identification (CWI) Shared Task 2018 in the English monolingual track . The purpose of the task was to determine if a word in a given sentence can be judged as complex or not by a certain target audience. For the English track, task organizers provided a training and a development datasets of 27,299 and 3,328 words respectively together with the sentence in which each word occurs. The words were judged as complex or not by 20 human evaluators; ten of whom are natives. We submitted two systems: one system modeled each word to evaluate as a numeric vector populated with a set of lexical, semantic and contextual features while the other system relies on a word embedding representation and a distance metric. We trained two separate classifiers to automatically decide if each word is complex or not. We submitted six runs, two for each of the three subsets of the English monolingual CWI track.This work is (partly) supported by the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502) and by the TUNER project (TIN2015-65308-C5-5-R, MINECO/FEDER, UE)

    LaSTUS/TALN at Complex Word Identification (CWI) 2018 Shared Task

    No full text
    Comunicació presentada al 13th Workshop on Innovative Use of NLP for Building Educational Applications, celebrat el dia 5 de juny de 2018 a Nova Orleans, EUA.This paper presents the participation of the LaSTUS/TALN team in the Complex Word Identification (CWI) Shared Task 2018 in the English monolingual track . The purpose of the task was to determine if a word in a given sentence can be judged as complex or not by a certain target audience. For the English track, task organizers provided a training and a development datasets of 27,299 and 3,328 words respectively together with the sentence in which each word occurs. The words were judged as complex or not by 20 human evaluators; ten of whom are natives. We submitted two systems: one system modeled each word to evaluate as a numeric vector populated with a set of lexical, semantic and contextual features while the other system relies on a word embedding representation and a distance metric. We trained two separate classifiers to automatically decide if each word is complex or not. We submitted six runs, two for each of the three subsets of the English monolingual CWI track.This work is (partly) supported by the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502) and by the TUNER project (TIN2015-65308-C5-5-R, MINECO/FEDER, UE)

    Leveraging contextual representations with BiLSTM-based regressor for lexical complexity prediction

    Get PDF
    Lexical complexity prediction (LCP) determines the complexity level of words or phrases in a sentence. LCP has a significant impact on the enhancement of language translations, readability assessment, and text generation. However, the domain-specific technical word, the complex grammatical structure, the polysemy problem, the inter-word relationship, and dependencies make it challenging to determine the complexity of words or phrases. In this paper, we propose an integrated transformer regressor model named ITRM-LCP to estimate the lexical complexity of words and phrases where diverse contextual features are extracted from various transformer models. The transformer models are fine-tuned using the text-pair data. Then, a bidirectional LSTM-based regressor module is plugged on top of each transformer to learn the long-term dependencies and estimate the complexity scores. The predicted scores of each module are then aggregated to determine the final complexity score. We assess our proposed model using two benchmark datasets from shared tasks. Experimental findings demonstrate that our ITRM-LCP model obtains 10.2% and 8.2% improvement on the news and Wikipedia corpus of the CWI-2018 dataset, compared to the top-performing systems (DAT, CAMB, and TMU). Additionally, our ITRM-LCP model surpasses state-of-the-art LCP systems (DeepBlueAI, JUST-BLUE) by 1.5% and 1.34% for single and multi-word LCP tasks defined in the SemEval LCP-2021 task

    Predicting lexical complexity in English texts: the Complex 2.0 dataset

    Get PDF
    © 2022 The Authors. Published by Springer. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://doi.org/10.1007/s10579-022-09588-2Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as complex word identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in which words and sometimes multi-word expressions are labelled regarding complexity are required. In this paper we analyze previous work carried out in this task and investigate the properties of CWI datasets for English. We develop a protocol for the annotation of lexical complexity and use this to annotate a new dataset, CompLex 2.0. We present experiments using both new and old datasets to investigate the nature of lexical complexity. We found that a Likert-scale annotation protocol provides an objective setting that is superior for identifying the complexity of words compared to a binary annotation protocol. We release a new dataset using our new protocol to promote the task of Lexical Complexity Prediction

    Lexical complexity prediction: an overview

    Get PDF
    The occurrence of unknown words in texts significantly hinders reading comprehension. To improve accessibility for specific target populations, computational modeling has been applied to identify complex words in texts and substitute them for simpler alternatives. In this article, we present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data. We survey relevant approaches to this problem which include traditional machine learning classifiers (e.g., SVMs, logistic regression) and deep neural networks as well as a variety of features, such as those inspired by literature in psycholinguistics as well as word frequency, word length, and many others. Furthermore, we introduce readers to past competitions and available datasets created on this topic. Finally, we include brief sections on applications of lexical complexity prediction, such as readability and text simplification, together with related studies on languages other than English
    corecore