7,034 research outputs found
Noise or music? Investigating the usefulness of normalisation for robust sentiment analysis on social media data
In the past decade, sentiment analysis research has thrived, especially on social media. While this data genre is suitable to extract opinions and sentiment, it is known to be noisy. Complex normalisation methods have been developed to transform noisy text into its standard form, but their effect on tasks like sentiment analysis remains underinvestigated. Sentiment analysis approaches mostly include spell checking or rule-based normalisation as preprocess- ing and rarely investigate its impact on the task performance. We present an optimised sentiment classifier and investigate to what extent its performance can be enhanced by integrating SMT-based normalisation as preprocessing. Experiments on a test set comprising a variety of user-generated content genres revealed that normalisation improves sentiment classification performance on tweets and blog posts, showing the model’s ability to generalise to other data genres
Text segmentation on multilabel documents: A distant-supervised approach
Segmenting text into semantically coherent segments is an important task with
applications in information retrieval and text summarization. Developing
accurate topical segmentation requires the availability of training data with
ground truth information at the segment level. However, generating such labeled
datasets, especially for applications in which the meaning of the labels is
user-defined, is expensive and time-consuming. In this paper, we develop an
approach that instead of using segment-level ground truth information, it
instead uses the set of labels that are associated with a document and are
easier to obtain as the training data essentially corresponds to a multilabel
dataset. Our method, which can be thought of as an instance of distant
supervision, improves upon the previous approaches by exploiting the fact that
consecutive sentences in a document tend to talk about the same topic, and
hence, probably belong to the same class. Experiments on the text segmentation
task on a variety of datasets show that the segmentation produced by our method
beats the competing approaches on four out of five datasets and performs at par
on the fifth dataset. On the multilabel text classification task, our method
performs at par with the competing approaches, while requiring significantly
less time to estimate than the competing approaches.Comment: Accepted in 2018 IEEE International Conference on Data Mining (ICDM
- …