78 research outputs found

    Controllable Text Simplification with Explicit Paraphrasing

    Get PDF
    Text Simplification improves the readability of sentences through several rewriting transformations, such as lexical paraphrasing, deletion, and splitting. Current simplification systems are predominantly sequence-to-sequence models that are trained end-to-end to perform all these operations simultaneously. However, such systems limit themselves to mostly deleting words and cannot easily adapt to the requirements of different target audiences. In this paper, we propose a novel hybrid approach that leverages linguistically-motivated rules for splitting and deletion, and couples them with a neural paraphrasing model to produce varied rewriting styles. We introduce a new data augmentation method to improve the paraphrasing capability of our model. Through automatic and manual evaluations, we show that our proposed model establishes a new state-of-the-art for the task, paraphrasing more often than the existing systems, and can control the degree of each simplification operation applied to the input texts

    Lexical simplification for the systematic support of cognitive accessibility guidelines

    Get PDF
    The Internet has come a long way in recent years, contributing to the proliferation of large volumes of digitally available information. Through user interfaces we can access these contents, however, they are not accessible to everyone. The main users affected are people with disabilities, who are already a considerable number, but accessibility barriers affect a wide range of user groups and contexts of use in accessing digital information. Some of these barriers are caused by language inaccessibility when texts contain long sentences, unusual words and complex linguistic structures. These accessibility barriers directly affect people with cognitive disabilities. For the purpose of making textual content more accessible, there are initiatives such as the Easy Reading guidelines, the Plain Language guidelines and some of the languagespecific Web Content Accessibility Guidelines (WCAG). These guidelines provide documentation, but do not specify methods for meeting the requirements implicit in these guidelines in a systematic way. To obtain a solution, methods from the Natural Language Processing (NLP) discipline can provide support for achieving compliance with the cognitive accessibility guidelines for the language. The task of text simplification aims at reducing the linguistic complexity of a text from a syntactic and lexical perspective, the latter being the main focus of this Thesis. In this sense, one solution space is to identify in a text which words are complex or uncommon, and in the case that there were, to provide a more usual and simpler synonym, together with a simple definition, all oriented to people with cognitive disabilities. With this goal in mind, this Thesis presents the study, analysis, design and development of an architecture, NLP methods, resources and tools for the lexical simplification of texts for the Spanish language in a generic domain in the field of cognitive accessibility. To achieve this, each of the steps present in the lexical simplification processes is studied, together with methods for word sense disambiguation. As a contribution, different types of word embedding are explored and created, supported by traditional and dynamic embedding methods, such as transfer learning methods. In addition, since most of the NLP methods require data for their operation, a resource in the framework of cognitive accessibility is presented as a contribution.Internet ha avanzado mucho en los últimos años contribuyendo a la proliferación de grandes volúmenes de información disponible digitalmente. A través de interfaces de usuario podemos acceder a estos contenidos, sin embargo, estos no son accesibles a todas las personas. Los usuarios afectados principalmente son las personas con discapacidad siendo ya un número considerable, pero las barreras de accesibilidad afectan a un gran rango de grupos de usuarios y contextos de uso en el acceso a la información digital. Algunas de estas barreras son causadas por la inaccesibilidad al lenguaje cuando los textos contienen oraciones largas, palabras inusuales y estructuras lingüísticas complejas. Estas barreras de accesibilidad afectan directamente a las personas con discapacidad cognitiva. Con el fin de hacer el contenido textual más accesible, existen iniciativas como las pautas de Lectura Fácil, las pautas de Lenguaje Claro y algunas de las pautas de Accesibilidad al Contenido en la Web (WCAG) específicas para el lenguaje. Estas pautas proporcionan documentación, pero no especifican métodos para cumplir con los requisitos implícitos en estas pautas de manera sistemática. Para obtener una solución, los métodos de la disciplina del Procesamiento del Lenguaje Natural (PLN) pueden dar un soporte para alcanzar la conformidad con las pautas de accesibilidad cognitiva relativas al lenguaje La tarea de la simplificación de textos del PLN tiene como objetivo reducir la complejidad lingüística de un texto desde una perspectiva sintáctica y léxica, siendo esta última el enfoque principal de esta Tesis. En este sentido, un espacio de solución es identificar en un texto qué palabras son complejas o poco comunes, y en el caso de que sí hubiera, proporcionar un sinónimo más usual y sencillo, junto con una definición sencilla, todo ello orientado a las personas con discapacidad cognitiva. Con tal meta, en esta Tesis, se presenta el estudio, análisis, diseño y desarrollo de una arquitectura, métodos PLN, recursos y herramientas para la simplificación léxica de textos para el idioma español en un dominio genérico en el ámbito de la accesibilidad cognitiva. Para lograr esto, se estudia cada uno de los pasos presentes en los procesos de simplificación léxica, junto con métodos para la desambiguación del sentido de las palabras. Como contribución, diferentes tipos de word embedding son explorados y creados, apoyados por métodos embedding tradicionales y dinámicos, como son los métodos de transfer learning. Además, debido a que gran parte de los métodos PLN requieren datos para su funcionamiento, se presenta como contribución un recurso en el marco de la accesibilidad cognitiva.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: José Antonio Macías Iglesias.- Secretario: Israel González Carrasco.- Vocal: Raquel Hervás Ballestero

    Lexical Simplification for Non-Native English Speakers

    Get PDF
    Lexical Simplification is the process of replacing complex words in texts to create simpler, more easily comprehensible alternatives. It has proven very useful as an assistive tool for users who may find complex texts challenging. Those who suffer from Aphasia and Dyslexia are among the most common beneficiaries of such technology. In this thesis we focus on Lexical Simplification for English using non-native English speakers as the target audience. Even though they number in hundreds of millions, there are very few contributions that aim to address the needs of these users. Current work is unable to provide solutions for this audience due to lack of user studies, datasets and resources. Furthermore, existing work in Lexical Simplification is limited regardless of the target audience, as it tends to focus on certain steps of the simplification process and disregard others, such as the automatic detection of the words that require simplification. We introduce a series of contributions to the area of Lexical Simplification that range from user studies and resulting datasets to novel methods for all steps of the process and evaluation techniques. In order to understand the needs of non-native English speakers, we conducted three user studies with 1,000 users in total. These studies demonstrated that the number of words deemed complex by non-native speakers of English correlates with their level of English proficiency and appears to decrease with age. They also indicated that although words deemed complex tend to be much less ambiguous and less frequently found in corpora, the complexity of words also depends on the context in which they occur. Based on these findings, we propose an ensemble approach which achieves state-of-the-art performance in identifying words that challenge non-native speakers of English. Using the insight and data gathered, we created two new approaches to Lexical Simplification that address the needs of non-native English speakers: joint and pipelined. The joint approach employs resource-light neural language models to simplify words deemed complex in a single step. While its performance was unsatisfactory, it proved useful when paired with pipelined approaches. Our pipelined simplifier generates candidate replacements for complex words using new, context-aware word embedding models, filters them for grammaticality and meaning preservation using a novel unsupervised ranking approach, and finally ranks them for simplicity using a novel supervised ranker that learns a model based on the needs of non-native English speakers. In order to test these and previous approaches, we designed LEXenstein, a framework for Lexical Simplification, and compiled NNSeval, a dataset that accounts for the needs of non-native English speakers. Comparisons against hundreds of previous approaches as well as the variants we proposed showed that our pipelined approach outperforms all others. Finally, we introduce PLUMBErr, a new automatic error identification framework for Lexical Simplification. Using this framework, we assessed the type and number of errors made by our pipelined approach throughout the simplification process and found that combining our ensemble complex word identifier with our pipelined simplifier yields a system that makes up to 25% fewer mistakes compared to the previous state-of-the-art strategies during the simplification process
    corecore