Search CORE

6 research outputs found

Lexical Simplification for Non-Native English Speakers

Author: Paetzold Gustavo Henrique
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 01/09/2016
Field of study

Lexical Simplification is the process of replacing complex words in texts to create simpler, more easily comprehensible alternatives. It has proven very useful as an assistive tool for users who may find complex texts challenging. Those who suffer from Aphasia and Dyslexia are among the most common beneficiaries of such technology. In this thesis we focus on Lexical Simplification for English using non-native English speakers as the target audience. Even though they number in hundreds of millions, there are very few contributions that aim to address the needs of these users. Current work is unable to provide solutions for this audience due to lack of user studies, datasets and resources. Furthermore, existing work in Lexical Simplification is limited regardless of the target audience, as it tends to focus on certain steps of the simplification process and disregard others, such as the automatic detection of the words that require simplification. We introduce a series of contributions to the area of Lexical Simplification that range from user studies and resulting datasets to novel methods for all steps of the process and evaluation techniques. In order to understand the needs of non-native English speakers, we conducted three user studies with 1,000 users in total. These studies demonstrated that the number of words deemed complex by non-native speakers of English correlates with their level of English proficiency and appears to decrease with age. They also indicated that although words deemed complex tend to be much less ambiguous and less frequently found in corpora, the complexity of words also depends on the context in which they occur. Based on these findings, we propose an ensemble approach which achieves state-of-the-art performance in identifying words that challenge non-native speakers of English. Using the insight and data gathered, we created two new approaches to Lexical Simplification that address the needs of non-native English speakers: joint and pipelined. The joint approach employs resource-light neural language models to simplify words deemed complex in a single step. While its performance was unsatisfactory, it proved useful when paired with pipelined approaches. Our pipelined simplifier generates candidate replacements for complex words using new, context-aware word embedding models, filters them for grammaticality and meaning preservation using a novel unsupervised ranking approach, and finally ranks them for simplicity using a novel supervised ranker that learns a model based on the needs of non-native English speakers. In order to test these and previous approaches, we designed LEXenstein, a framework for Lexical Simplification, and compiled NNSeval, a dataset that accounts for the needs of non-native English speakers. Comparisons against hundreds of previous approaches as well as the variants we proposed showed that our pipelined approach outperforms all others. Finally, we introduce PLUMBErr, a new automatic error identification framework for Lexical Simplification. Using this framework, we assessed the type and number of errors made by our pipelined approach throughout the simplification process and found that combining our ensemble complex word identifier with our pipelined simplifier yields a system that makes up to 25% fewer mistakes compared to the previous state-of-the-art strategies during the simplification process

White Rose E-theses Online

Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Words

Author: Paetzold Gustavo Henrique
Specia Lucia
Publication venue
Publication date
Field of study

Conference paper: Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Word

ZENODO

Can lies be faked? Comparing low-stakes and high-stakes deception video datasets from a Machine Learning perspective

Author: Camara Mateus Karvat
Maul Tomas Henrique
Paetzold Gustavo
Postal Adriana
Publication venue
Publication date: 18/08/2023
Field of study

Despite the great impact of lies in human societies and a meager 54% human accuracy for Deception Detection (DD), Machine Learning systems that perform automated DD are still not viable for proper application in real-life settings due to data scarcity. Few publicly available DD datasets exist and the creation of new datasets is hindered by the conceptual distinction between low-stakes and high-stakes lies. Theoretically, the two kinds of lies are so distinct that a dataset of one kind could not be used for applications for the other kind. Even though it is easier to acquire data on low-stakes deception since it can be simulated (faked) in controlled settings, these lies do not hold the same significance or depth as genuine high-stakes lies, which are much harder to obtain and hold the practical interest of automated DD systems. To investigate whether this distinction holds true from a practical perspective, we design several experiments comparing a high-stakes DD dataset and a low-stakes DD dataset evaluating their results on a Deep Learning classifier working exclusively from video data. In our experiments, a network trained in low-stakes lies had better accuracy classifying high-stakes deception than low-stakes, although using low-stakes lies as an augmentation strategy for the high-stakes dataset decreased its accuracy.Comment: 11 pages, 3 figure

arXiv.org e-Print Archive

SemEval-2021 task 1: Lexical complexity prediction

Author: Evans Richard
Paetzold Gustavo Henrique
Shardlow Matthew
Zampieri Marcos
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 21/05/2021
Field of study

© 2021 The Authors. Published by ACL. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://aclanthology.org/2021.semeval-1.1This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al. 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.Published versio

arXiv.org e-Print Archive

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Wolverhampton Intellectual Repository and E-theses

Quality Estimation for Machine Translation

Author: Alvin
Angrosh Mandya
Asano Hiroki
Avramidis Eleftherios
Avramidis Eleftherios
Bach Nguyen
Beck Daniel
Beck Daniel
Beck Daniel
Bergen Zachary
Bergstra James
Biçici Ergun
Biçici Ergun
Biçici Ergun
Biçici Ergun
Biçici Ergun
Blain Frédéric
Bojar Ondřej
Buck Christian
Burstein Jill
Burstein Jill
Camargo de Souza José Guilherme
Carolina Scarton
Chase Lin Lawrence
Cohn Trevor
Dahlmeier Daniel
Denkowski Michael
Dong Fei
Dusek Ondřej
Felice Mariano
Findings
Formiga Lluís
Gamon Michael
Gandrabur Simona
Giannakopoulos George
Glavas Goran
Goldwasser Dan
Goodfellow Ian J.
Gustavo
Gustavo Henrique Paetzold
Hardmeier Christian
He Yifan
Hildebrand Silja
Hokamp Chris
Ive Julia
Jones Douglas A.
Kauchak David
Kim Hyun
Kiros Ryan
Koehn Philipp
Koponen Maarit
Krings Hans P.
Lafferty John D.
Lampouras Gerasimos
Landauer Thomas K.
Lavergne Thomas
Lin Chin-Yew
Logacheva Varvara
Logacheva Varvara
Lucia Specia
Mairesse François
Martins André F. T.
Mathias Sandeep
McLaughlin G. Harry
Meurers Detmar
Mikolov Tomas
Mikolov Tomas
Napoles Courtney
Napoles Courtney
Negri Matteo
Ng Hwee Tou
Ng Raymond W. M.
Nisioi Sergiu
Och Franz Josef
Paetzold Gustavo Henrique
Paiva Daniel
Parker Robert
Pedregosa Fabian
Persing Isaac
Popović Maja
Potet Marion
Quirk C. B.
Reiter Ehud
Richardson Matthew
Rikters Matīss
Rubino Raphael
Saggion Horacio
Sakaguchi Keisuke
Scarton Carolina
Scarton Carolina
Scarton Carolina
Servan Christophe
Shah Kashif
Singh Abhishek
Singh Anil Kumar
Snover Matthew
Soricut Radu
Soricut Radu
Soricut Radu
Specia Lucia
Specia Lucia
Specia Lucia
Specia Lucia
Specia Lucia
Specia Lucia
Specia Lucia
Stajner Sanja
Stajner Sanja
Steinberger Josef
Taylor Wilson L.
Turchi Marco
Turchi Marco
Ueffing Nicola
Venugopal Ashish
Wisniewski Guillaume
Wubben Sander
Xiong Deyi
Xu Wei
Zhang Hao
Zwillinger Daniel
Publication venue: 'Morgan & Claypool Publishers LLC'
Publication date
Field of study

Crossref