Natural Language Processing methods for short informal text

Abstract

The change in the English language is faster than any time before. Social media is playing a great role in this change as it has become an essential part of peoples social life. Thoughts, ideas, feelings, or even special moments are the main contents of the posts on Twitter and Facebook which are the most popular social media platforms. In this work, we addressed the change in language problem and how it affects the traditional techniques of Natural Language Processing (NLP) for this specific domain. Such a domain is considered to be a challenge for many NLP methods like topic modelling, named entity recognition, and sentiment analysis. We produced novel methods in NLP that target the short text informality. Our first novel model is in topic modelling for short messy text. The proposed model was inspired by the relation between the word's frequency and the context words frequencies (words surrounding the selected word) over time. This relation had been translated to co-occurrence patterns and stored as word embeddings after being transformed into feature space. The features had been generated from the frequencies of words and context words by our novel Term Frequency-Inverse Context Term Frequency (TF-ICTF) algorithm. TF-ICTF had been derived from the traditional standard algorithm Term Frequency-Inverse Document Frequency (TF-IDF) which did not perform well on short messy text. The proposed model is based on the words probabilities and co-occurrences between words within the short text. Therefore, we named our proposed approach the Probabilistic Relational Supervised Topic Modelling. The second approach addresses the non-standard entities in a short text. We proposed a new model using word patterns embeddings that are generated from the Twitter streamed data. These patterns should include entities that are identified by the state-of-the-art of the named entity recognition (NER) algorithms. We named our approach the Probabilistic Named Entity Recognition (PNER). PNER was trained on the identified entities in the pattern embeddings to identify the non-standard entities format. Lastly, our Probabilistic co-occurrence Relational Sentiment (PR_ Sentiment) approach proposed to sentimentally classify tweets. We used sentiment patterns detected from the short text tweets. These patterns are structured by an n-gram technique. These n-grams will be detected from sentimentally annotated tweets and labeled accordingly. The dataset that was used is a standard dataset with more than one million annotated tweets. Moreover, the PR\_ Sentiment model performs within near real-time. The aim of our project is to address the informality and non-standardization in social media short text and produce novel NLP methods. These methods were designed as a novel approach towards generalising the short messy text processing. Therefore, our methods have been tested and compared against several state-of-the-art approaches to show novelty

    Similar works

    Full text

    thumbnail-image