134 research outputs found
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
Text preprocessing is often the first step in the pipeline of a Natural
Language Processing (NLP) system, with potential impact in its final
performance. Despite its importance, text preprocessing has not received much
attention in the deep learning literature. In this paper we investigate the
impact of simple text preprocessing decisions (particularly tokenizing,
lemmatizing, lowercasing and multiword grouping) on the performance of a
standard neural text classifier. We perform an extensive evaluation on standard
benchmarks from text categorization and sentiment analysis. While our
experiments show that a simple tokenization of input text is generally
adequate, they also highlight significant degrees of variability across
preprocessing techniques. This reveals the importance of paying attention to
this usually-overlooked step in the pipeline, particularly when comparing
different models. Finally, our evaluation provides insights into the best
preprocessing practices for training word embeddings.Comment: Blackbox EMNLP 2018. 7 page
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
A Unified multilingual semantic representation of concepts
Semantic representation lies at the core of several applications in Natural Language Processing. However, most existing semantic representation techniques cannot be used effectively for the representation of individual word senses. We put forward a novel multilingual concept representation, called MUFFIN , which not only enables accurate representation of word senses in different languages, but also provides multiple advantages over existing approaches. MUFFIN represents a given concept in a unified semantic space irrespective of the language of interest, enabling cross-lingual comparison of different concepts. We evaluate our approach in two different evaluation benchmarks, semantic similarity and Word Sense Disambiguation, reporting state-of-the-art performance on several standard datasets
NASARI: a novel approach to a Semantically-Aware Representation of items
The semantic representation of individual word senses and concepts is of fundamental importance to several applications in Natural Language Processing. To date, concept modeling techniques have in the main based their representation either on lexicographic resources, such as WordNet, or on encyclopedic resources, such as Wikipedia. We propose a vector representation technique that combines the complementary knowledge of both these types of resource. Thanks to its use of explicit semantics combined with a novel cluster-based dimensionality reduction and an effective weighting scheme, our representation attains state-of-the-art performance on multiple datasets in two standard benchmarks: word similarity and sense clustering. We are releasing our vector representations at http://lcl.uniroma1.it/nasari/
Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation
The automatic detection of hate speech online is an active research area in
NLP. Most of the studies to date are based on social media datasets that
contribute to the creation of hate speech detection models trained on them.
However, data creation processes contain their own biases, and models
inherently learn from these dataset-specific biases. In this paper, we perform
a large-scale cross-dataset comparison where we fine-tune language models on
different hate speech detection datasets. This analysis shows how some datasets
are more generalisable than others when used as training data. Crucially, our
experiments show how combining hate speech detection datasets can contribute to
the development of robust hate speech detection models. This robustness holds
even when controlling by data size and compared with the best individual
datasets.Comment: Accepted in "Workshop on Online Abuse and Harms (WOAH)", 202
Semantic vector representations of senses, concepts and entities and their applications in natural language processing
Representation learning lies at the core of Artificial Intelligence (AI) and Natural Language Processing (NLP). Most recent research has focused on develop representations at the word level. In particular, the representation of words in a vector space has been viewed as one of the most important successes of lexical semantics and NLP in recent years. The generalization power and flexibility of these representations have enabled their integration into a wide variety of text-based applications, where they have proved extremely beneficial. However, these representations are hampered by an important limitation, as they are unable to model different meanings of the same word.
In order to deal with this issue, in this thesis we analyze and develop flexible semantic representations of meanings, i.e. senses, concepts and entities. This finer distinction enables us to model semantic information at a deeper level, which in turn is essential for dealing with ambiguity.
In addition, we view these (vector) representations as a connecting bridge between lexical resources and textual data, encoding knowledge from both sources. We argue that these sense-level representations, similarly to the importance of word embeddings, constitute a first step for seamlessly integrating explicit knowledge into NLP applications, while focusing on the deeper sense level. Its use does not only aim at solving the inherent lexical ambiguity of language, but also represents a first step to the integration of background knowledge into NLP applications. Multilinguality is another key feature of these representations, as we explore the construction language-independent and multilingual techniques that can be applied to arbitrary languages, and also across languages.
We propose simple unsupervised and supervised frameworks which make use of these vector representations for word sense disambiguation, a key application in natural language understanding, and other downstream applications such as text categorization and sentiment analysis. Given the nature of the vectors, we also investigate their effectiveness for improving and enriching knowledge bases, by reducing the sense granularity of their sense inventories and extending them with domain labels, hypernyms and collocations
- …