8 research outputs found

    Effect of Text Processing Steps on Twitter Sentiment Classification using Word Embedding

    Get PDF
    Processing of raw text is the crucial first step in text classification and sentiment analysis. However, text processing steps are often performed using off-the-shelf routines and pre-built word dictionaries without optimizing for domain, application, and context. This paper investigates the effect of seven text processing scenarios on a particular text domain (Twitter) and application (sentiment classification). Skip gram-based word embeddings are developed to include Twitter colloquial words, emojis, and hashtag keywords that are often removed for being unavailable in conventional literature corpora. Our experiments reveal negative effects on sentiment classification of two common text processing steps: 1) stop word removal and 2) averaging of word vectors to represent individual tweets. New effective steps for 1) including non-ASCII emoji characters, 2) measuring word importance from word embedding, 3) aggregating word vectors into a tweet embedding, and 4) developing linearly separable feature space have been proposed to optimize the sentiment classification pipeline. The best combination of text processing steps yields the highest average area under the curve (AUC) of 88.4 (+/-0.4) in classifying 14,640 tweets with three sentiment labels. Word selection from context-driven word embedding reveals that only the ten most important words in Tweets cumulatively yield over 98% of the maximum accuracy. Results demonstrate a means for data-driven selection of important words in tweet classification as opposed to using pre-built word dictionaries. The proposed tweet embedding is robust to and alleviates the need for several text processing steps.Comment: 14 pages, 3 figures, 7 table

    Análise, seleção e teste de ferramentas para coleta de dados sobre objetos móveis visando enriquecimento semântico

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Ciências da Computação.A quantidade de dados gerados pelo amplo uso das midias sociais (e.g. Twitter, Facebook) é massiva e aumenta a cada segundo. Muitos desses dados estão publicamente disponíveis e podem alimentar uma ampla variedade de aplicações. Entretanto, postagens em mídias sociais têm conteúdo textual não estruturado, sujeito a ruídos e problemas de interpretação. Assim, tais postagens precisam ser semanticamente enriquecidas antes de serem utilizadas em certas aplicações. O processo de enriquecimento semântico de postagens em mídias sociais requer informações sobre o contexto dentro do qual tais postagens são feitas, de modo a gerar anotações de qualidade. Este trabalho apresenta um esforço de coleta e organização de dados para experimentos de enriquecimento semântico de postagens em mídias sociais, uma revisão de literatura sobre ferramentas para coleta de informações adicionais de contexto que possam auxiliar no enriquecimento semântico e uma proposta de aplicativo para coleta de dados sobre objetos móveis no momento de postagens no Twitter. Os resultados obtidos neste trabalho são: (i) um esquema remodelado e expandido para organizar a base de dados sobre objetos móveis, suas postagens e experimentos de enriquecimento semântico de tais postagens no Laboratório de Integração de Sistemas e Aplicações (LISA); (ii) coleta e organização de um grande volume de dados do Twitter, incluindo centenas de milhões de tweets e informações de perfil dos usuários que os postaram; (iii) uma análise comparativa de ferramentas para coleta de informações de contexto de objetos móveis, as quais podem complementar informação de perfil e auxiliar no processo de enriquecimento semântico de postagens em mídias sociais; (iv) uma ferramenta para coleta de dados sobre contexto de objetos móveis no momento de postagens no Twitter; (v) resultados considerando os testes preliminares de anotação semântica dos tweets coletados com dados ligados abertos.The amount of data generated by the widespread use of social media (e.g. Twitter, Facebook) is massive and increases every second. Many of these data are publicly available and can feed a wide variety of applications. However, posts in social media have unstructured textual content, subject to noise and interpretation problems. Thus, such posts need to be semantically enriched before being used in certain applications. The process of semantic enrichment of postings in social media requires information about the context within which such posts are made, to generate quality annotations. This work presents an effort to collect and organize data for semantic enrichment experiments of social media postings, a literature review on tools to collect additional context information that may help the semantic enrichment and an application proposal for data collection on mobile objects at the time of Twitter posts. The results obtained in this work are: (i) a remodeled and expanded scheme to organize the database of mobile objects, their posts and experiments of semantic enrichment of such postings in the Laboratory of Integration of Systems and Application (LISA); (ii) collecting and organizing large amounts of Twitter data, including hundreds of millions of tweets and profile information from users who posted them; (iii) a comparative analysis of tools for collecting context information of mobile objects, which can complement profile information and assist in the process of semantic enrichment in social media posts; (iv) a tool for collecting data about the context of mobile objects at the time of Twitter posts; (v) preliminary semantic annotation experiments of tweets collected with open linked data

    Semantics-Driven Aspect-Based Sentiment Analysis

    Get PDF
    People using the Web are constantly invited to share their opinions and preferences with the rest of the world, which has led to an explosion of opinionated blogs, reviews of products and services, and comments on virtually everything. This type of web-based content is increasingly recognized as a source of data that has added value for multiple application domains. While the large number of available reviews almost ensures that all relevant parts of the entity under review are properly covered, manually reading each and every review is not feasible. Aspect-based sentiment analysis aims to solve this issue, as it is concerned with the development of algorithms that can automatically extract fine-grained sentiment information from a set of reviews, computing a separate sentiment value for the various aspects of the product or service being reviewed. This dissertation focuses on which discriminants are useful when performing aspect-based sentiment analysis. What signals for sentiment can be extracted from the text itself and what is the effect of using extra-textual discriminants? We find that using semantic lexicons or ontologies, can greatly improve the quality of aspect-based sentiment analysis, especially with limited training data. Additionally, due to semantics driving the analysis, the algorithm is less of a black box and results are easier to explain

    Compressing Labels of Dynamic XML Data using Base-9 Scheme and Fibonacci Encoding

    Get PDF
    The flexibility and self-describing nature of XML has made it the most common mark-up language used for data representation over the Web. XML data is naturally modelled as a tree, where the structural tree information can be encoded into labels via XML labelling scheme in order to permit answers to queries without the need to access original XML files. As the transmission of XML data over the Internet has become vibrant, it has also become necessary to have an XML labelling scheme that supports dynamic XML data. For a large-scale and frequently updated XML document, existing dynamic XML labelling schemes still suffer from high growth rates in terms of their label size, which can result in overflow problems and/or ambiguous data/query retrievals. This thesis considers the compression of XML labels. A novel XML labelling scheme, named “Base-9”, has been developed to generate labels that are as compact as possible and yet provide efficient support for queries to both static and dynamic XML data. A Fibonacci prefix-encoding method has been used for the first time to store Base-9’s XML labels in a compressed format, with the intention of minimising the storage space without degrading XML querying performance. The thesis also investigates the compression of XML labels using various existing prefix-encoding methods. This investigation has resulted in the proposal of a novel prefix-encoding method named “Elias-Fibonacci of order 3”, which has achieved the fastest encoding time of all prefix-encoding methods studied in this thesis, whereas Fibonacci encoding was found to require the minimum storage. Unlike current XML labelling schemes, the new Base-9 labelling scheme ensures the generation of short labels even after large, frequent, skewed insertions. The advantages of such short labels as those generated by the combination of applying the Base-9 scheme and the use of Fibonacci encoding in terms of storing, updating, retrieving and querying XML data are supported by the experimental results reported herein
    corecore