195 research outputs found

    On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis

    Full text link
    Text preprocessing is often the first step in the pipeline of a Natural Language Processing (NLP) system, with potential impact in its final performance. Despite its importance, text preprocessing has not received much attention in the deep learning literature. In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a standard neural text classifier. We perform an extensive evaluation on standard benchmarks from text categorization and sentiment analysis. While our experiments show that a simple tokenization of input text is generally adequate, they also highlight significant degrees of variability across preprocessing techniques. This reveals the importance of paying attention to this usually-overlooked step in the pipeline, particularly when comparing different models. Finally, our evaluation provides insights into the best preprocessing practices for training word embeddings.Comment: Blackbox EMNLP 2018. 7 page

    From Word to Sense Embeddings: A Survey on Vector Representations of Meaning

    Get PDF
    Over the past years, distributed semantic representations have proved to be effective and flexible keepers of prior knowledge to be integrated into downstream applications. This survey focuses on the representation of meaning. We start from the theoretical background behind word vector space models and highlight one of their major limitations: the meaning conflation deficiency, which arises from representing a word with all its possible meanings as a single vector. Then, we explain how this deficiency can be addressed through a transition from the word level to the more fine-grained level of word senses (in its broader acceptation) as a method for modelling unambiguous lexical meaning. We present a comprehensive overview of the wide range of techniques in the two main branches of sense representation, i.e., unsupervised and knowledge-based. Finally, this survey covers the main evaluation procedures and applications for this type of representation, and provides an analysis of four of its important aspects: interpretability, sense granularity, adaptability to different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence Researc

    Plan de acción de lucha contra los delitos de odio

    Get PDF
    En el año 2012 se puso en marcha el proyecto "formación para la identificación y registro de incidentes racistas" (FIRIR). Con este programa se publicó en el año 2013 el Manual de apoyo a la formación de fuerzas y cuerpos de seguridad en la identificación y registro de incidentes racistas o xenófobos como herramienta específica dirigida a los Cuerpos y Fuerzas de Seguridad nacionales, autonómicos y locales; para dotarles de los conocimientos precisos que les permitan llevar a cabo una eficaz detección y registro de incidentes racistas y xenófobos. Desde el año 2013 se viene elaborando un informe anual sobre la evolución de los incidentes relacionados con los delitos de odio en España. Dentro de la web del Ministerio del Interior hay un apartado específico dedicado a los delitos de odio. Durante los meses de marzo del 2015 al mes diciembre 2017 se realizó la "Encuesta sobre experiencias con incidentes relacionados con los delitos de odio" para mejorar la atención que reciben las víctimas de delitos de odio. Con la Orden General 2285 del 12 de febrero de 2018 se creó la Oficina Nacional de Lucha contra los Delitos de Odio. Está Integrada en el Gabinete de Coordinación y Estudios de la SES (Área del Sistema Estadístico y Atención a la Víctima) y formada por componentes de las Fuerzas y Cuerpos de Seguridad del Estado. Tiene los objetivos de formar, investigar, establecer relaciones entre instituciones y el Tercer Sector y centralizar los datos relevantes acabados por las FFCCSE. Con la Instrucción 17/2017 se da vía a la elaboración del Protocolo de actuación de las Fuerzas y Cuerpos de Seguridad para los para los delitos de odio y conductas que vulneran las normas legales sobre discriminación. Más recientemente, este mis mo año, se ha presentado por parte del Ministerio de Interior un ambicioso Plan de Acción de lucha contra los Delitos de Odio, que describe la estrategia de los próximos años para la lucha contra estos delitos.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tec

    A Unified multilingual semantic representation of concepts

    Get PDF
    Semantic representation lies at the core of several applications in Natural Language Processing. However, most existing semantic representation techniques cannot be used effectively for the representation of individual word senses. We put forward a novel multilingual concept representation, called MUFFIN , which not only enables accurate representation of word senses in different languages, but also provides multiple advantages over existing approaches. MUFFIN represents a given concept in a unified semantic space irrespective of the language of interest, enabling cross-lingual comparison of different concepts. We evaluate our approach in two different evaluation benchmarks, semantic similarity and Word Sense Disambiguation, reporting state-of-the-art performance on several standard datasets

    NASARI: a novel approach to a Semantically-Aware Representation of items

    Get PDF
    The semantic representation of individual word senses and concepts is of fundamental importance to several applications in Natural Language Processing. To date, concept modeling techniques have in the main based their representation either on lexicographic resources, such as WordNet, or on encyclopedic resources, such as Wikipedia. We propose a vector representation technique that combines the complementary knowledge of both these types of resource. Thanks to its use of explicit semantics combined with a novel cluster-based dimensionality reduction and an effective weighting scheme, our representation attains state-of-the-art performance on multiple datasets in two standard benchmarks: word similarity and sense clustering. We are releasing our vector representations at http://lcl.uniroma1.it/nasari/

    Embedding Words and Senses Together via Joint Knowledge-Enhanced Training

    Get PDF
    Word embeddings are widely used in Nat-ural Language Processing, mainly due totheir success in capturing semantic infor-mation from massive corpora. However,their creation process does not allow thedifferent meanings of a word to be auto-matically separated, as it conflates theminto a single vector. We address this issueby proposing a new model which learnsword and sense embeddings jointly. Ourmodel exploits large corpora and knowl-edge from semantic networks in order toproduce a unified vector space of wordand sense embeddings. We evaluate themain features of our approach both qual-itatively and quantitatively in a variety oftasks, highlighting the advantages of theproposed method in comparison to state-of-the-art word- and sense-based models

    Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation

    Full text link
    The automatic detection of hate speech online is an active research area in NLP. Most of the studies to date are based on social media datasets that contribute to the creation of hate speech detection models trained on them. However, data creation processes contain their own biases, and models inherently learn from these dataset-specific biases. In this paper, we perform a large-scale cross-dataset comparison where we fine-tune language models on different hate speech detection datasets. This analysis shows how some datasets are more generalisable than others when used as training data. Crucially, our experiments show how combining hate speech detection datasets can contribute to the development of robust hate speech detection models. This robustness holds even when controlling by data size and compared with the best individual datasets.Comment: Accepted in "Workshop on Online Abuse and Harms (WOAH)", 202