50 research outputs found

    On Methods of Data Standardization of German Social Media Comments

    Full text link
    [EN] This article is part of a larger project aiming at identifying discursive strategies in social media discourses revolving around the topic of gender diversity, for which roughly 350,000 comments were scraped from the comments sections below YouTube videos relating to the topic in question. This article focuses on different methods of standardizing social media data in order to enhance further processing. More specifically, the data are corrected in terms of casing, spelling, and punctuation. Different tools and models (LanguageTool, T5, seq2seq, GPT-2) were tested. The best outcome was achieved by the German GPT-2 model: It scored highest in all of the applied scores (ROUGE, GLEU, BLEU), making it the best model for the task of Grammatical Error Correction in German social media data.Melnyk, L.; Feld, L. (2023). On Methods of Data Standardization of German Social Media Comments. Journal of Computer-Assisted Linguistic Research. 7:22-42. https://doi.org/10.4995/jclr.2023.199072242

    Generating Grammar Exercises

    Get PDF
    International audienceGrammar exercises for language learning fall into two distinct classes: those that are based on ''real life sentences'' extracted from existing documents or from the web; and those that seek to facilitate language acquisition by presenting the learner with exercises whose syntax is as simple as possible and whose vocabulary is restricted to that contained in the textbook being used. In this paper, we introduce a framework (called gramex) which permits generating the second type of grammar exercises. Using generation techniques, we show that a grammar can be used to semi-automatically generate grammar exercises which target a specific learning goal; are made of short, simple sentences; and whose vocabulary is restricted to that used in a given textbook

    GECTurk: Grammatical Error Correction and Detection Dataset for Turkish

    Full text link
    Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.Comment: Accepted at Findings of IJCNLP-AACL 202

    Lexical simplification for the systematic support of cognitive accessibility guidelines

    Get PDF
    The Internet has come a long way in recent years, contributing to the proliferation of large volumes of digitally available information. Through user interfaces we can access these contents, however, they are not accessible to everyone. The main users affected are people with disabilities, who are already a considerable number, but accessibility barriers affect a wide range of user groups and contexts of use in accessing digital information. Some of these barriers are caused by language inaccessibility when texts contain long sentences, unusual words and complex linguistic structures. These accessibility barriers directly affect people with cognitive disabilities. For the purpose of making textual content more accessible, there are initiatives such as the Easy Reading guidelines, the Plain Language guidelines and some of the languagespecific Web Content Accessibility Guidelines (WCAG). These guidelines provide documentation, but do not specify methods for meeting the requirements implicit in these guidelines in a systematic way. To obtain a solution, methods from the Natural Language Processing (NLP) discipline can provide support for achieving compliance with the cognitive accessibility guidelines for the language. The task of text simplification aims at reducing the linguistic complexity of a text from a syntactic and lexical perspective, the latter being the main focus of this Thesis. In this sense, one solution space is to identify in a text which words are complex or uncommon, and in the case that there were, to provide a more usual and simpler synonym, together with a simple definition, all oriented to people with cognitive disabilities. With this goal in mind, this Thesis presents the study, analysis, design and development of an architecture, NLP methods, resources and tools for the lexical simplification of texts for the Spanish language in a generic domain in the field of cognitive accessibility. To achieve this, each of the steps present in the lexical simplification processes is studied, together with methods for word sense disambiguation. As a contribution, different types of word embedding are explored and created, supported by traditional and dynamic embedding methods, such as transfer learning methods. In addition, since most of the NLP methods require data for their operation, a resource in the framework of cognitive accessibility is presented as a contribution.Internet ha avanzado mucho en los últimos años contribuyendo a la proliferación de grandes volúmenes de información disponible digitalmente. A través de interfaces de usuario podemos acceder a estos contenidos, sin embargo, estos no son accesibles a todas las personas. Los usuarios afectados principalmente son las personas con discapacidad siendo ya un número considerable, pero las barreras de accesibilidad afectan a un gran rango de grupos de usuarios y contextos de uso en el acceso a la información digital. Algunas de estas barreras son causadas por la inaccesibilidad al lenguaje cuando los textos contienen oraciones largas, palabras inusuales y estructuras lingüísticas complejas. Estas barreras de accesibilidad afectan directamente a las personas con discapacidad cognitiva. Con el fin de hacer el contenido textual más accesible, existen iniciativas como las pautas de Lectura Fácil, las pautas de Lenguaje Claro y algunas de las pautas de Accesibilidad al Contenido en la Web (WCAG) específicas para el lenguaje. Estas pautas proporcionan documentación, pero no especifican métodos para cumplir con los requisitos implícitos en estas pautas de manera sistemática. Para obtener una solución, los métodos de la disciplina del Procesamiento del Lenguaje Natural (PLN) pueden dar un soporte para alcanzar la conformidad con las pautas de accesibilidad cognitiva relativas al lenguaje La tarea de la simplificación de textos del PLN tiene como objetivo reducir la complejidad lingüística de un texto desde una perspectiva sintáctica y léxica, siendo esta última el enfoque principal de esta Tesis. En este sentido, un espacio de solución es identificar en un texto qué palabras son complejas o poco comunes, y en el caso de que sí hubiera, proporcionar un sinónimo más usual y sencillo, junto con una definición sencilla, todo ello orientado a las personas con discapacidad cognitiva. Con tal meta, en esta Tesis, se presenta el estudio, análisis, diseño y desarrollo de una arquitectura, métodos PLN, recursos y herramientas para la simplificación léxica de textos para el idioma español en un dominio genérico en el ámbito de la accesibilidad cognitiva. Para lograr esto, se estudia cada uno de los pasos presentes en los procesos de simplificación léxica, junto con métodos para la desambiguación del sentido de las palabras. Como contribución, diferentes tipos de word embedding son explorados y creados, apoyados por métodos embedding tradicionales y dinámicos, como son los métodos de transfer learning. Además, debido a que gran parte de los métodos PLN requieren datos para su funcionamiento, se presenta como contribución un recurso en el marco de la accesibilidad cognitiva.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: José Antonio Macías Iglesias.- Secretario: Israel González Carrasco.- Vocal: Raquel Hervás Ballestero

    Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

    Full text link
    Research on Korean grammatical error correction (GEC) is limited compared to other major languages such as English and Chinese. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean. Thus, in this work, we first collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) to cover a wide range of error types and annotate them using our newly proposed tool called Korean Automatic Grammatical error Annotation System (KAGAS). KAGAS is a carefully designed edit alignment & classification tool that considers the nature of Korean on generating an alignment between a source sentence and a target sentence, and identifies error types on each aligned edit. We also present baseline models fine-tuned over our datasets. We show that the model trained with our datasets significantly outperforms the public statistical GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets.Comment: Add affiliation and email addres

    The Best Explanation:Beyond Right and Wrong in Question Answering

    Get PDF

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
    corecore