138,424 research outputs found

    The Role of Text Pre-processing in Sentiment Analysis

    Get PDF
    It is challenging to understand the latest trends and summarise the state or general opinions about products due to the big diversity and size of social media data, and this creates the need of automated and real time opinion extraction and mining. Mining online opinion is a form of sentiment analysis that is treated as a difficult text classification task. In this paper, we explore the role of text pre-processing in sentiment analysis, and report on experimental results that demonstrate that with appropriate feature selection and representation, sentiment analysis accuracies using support vector machines (SVM) in this area may be significantly improved. The level of accuracy achieved is shown to be comparable to the ones achieved in topic categorisation although sentiment analysis is considered to be a much harder problem in the literature

    Topic Classification for Short Texts

    Get PDF
    In the context of TV and social media surveillance, constructing models to automate topic identification of short texts is key task. This paper formalizes the topic classification as a top-K multinomial classification problem and constructs worth-to-consider models for practical usage. We describe the full data processing pipeline, discussing about dataset selection, text preprocessing, feature extraction, model selection and learning, including hyperparameter optimization. When computing time and resources are limited, we show that a classical model like SVM performs as well as an advanced deep neural network, but with shorter model training time

    A Topical Approach to Capturing Customer Insight In Social Media

    Full text link
    The age of social media has opened new opportunities for businesses. This flourishing wealth of information is outside traditional channels and frameworks of classical marketing research, including that of Marketing Mix Modeling (MMM). Textual data, in particular, poses many challenges that data analysis practitioners must tackle. Social media constitute massive, heterogeneous, and noisy document sources. Industrial data acquisition processes include some amount of ETL. However, the variability of noise in the data and the heterogeneity induced by different sources create the need for ad-hoc tools. Put otherwise, customer insight extraction in fully unsupervised, noisy contexts is an arduous task. This research addresses the challenge of fully unsupervised topic extraction in noisy, Big Data contexts. We present three approaches we built on the Variational Autoencoder framework: the Embedded Dirichlet Process, the Embedded Hierarchical Dirichlet Process, and the time-aware Dynamic Embedded Dirichlet Process. These nonparametric approaches concerning topics present the particularity of determining word embeddings and topic embeddings. These embeddings do not require transfer learning, but knowledge transfer remains possible. We test these approaches on benchmark and automotive industry-related datasets from a real-world use case. We show that our models achieve equal to better performance than state-of-the-art methods and that the field of topic modeling would benefit from improved evaluation metrics
    • …
    corecore