4 research outputs found

    A machine learning approach for Urdu text sentiment analysis

    Get PDF
    Product evaluations, ratings, and other sorts of online expressions have risen in popularity as a result of the emergence of social networking sites and blogs. Sentiment analysis has emerged as a new area of study for computational linguists as a result of this rapidly expanding data set. From around a decade ago, this has been a topic of discussion for English speakers. However, the scientific community completely ignores other important languages, such as Urdu. Morphologically, Urdu is one of the most complex languages in the world. For this reason, a variety of unique characteristics, such as the language's unusual morphology and unrestricted word order, make the Urdu language processing a difficult challenge to solve. This research provides a new framework for the categorization of Urdu language sentiments. The main contributions of the research are to show how important this multidimensional research problem is as well as its technical parts, such as the parsing algorithm, corpus, lexicon, etc. A new approach for Urdu text sentiment analysis including data gathering, pre-processing, feature extraction, feature vector formation, and finally, sentiment classification has been designed to deal with Urdu language sentiments. The result and discussion section provides a comprehensive comparison of the proposed work with the standard baseline method in terms of precision, recall, f-measure, and accuracy of three different types of datasets. In the overall comparison of the models, the proposed work shows an encouraging achievement in terms of accuracy and other metrics. Last but not least, this section also provides the featured trend and possible direction of the current work

    Discriminative Feature Spamming Technique for Roman Urdu Sentiment Analysis

    No full text

    On Multi-domain Sentence Level Sentiment Analysis for Roman Urdu

    Full text link
    Sentiment analysis, or opinion mining, is a computational process to determine the polarity of a topic, opinion, emotion, or attitude. Most of the work done onsentiment analysis is for resource-rich languages, such as English and Chinese. However, only limited work has been done for Roman Urdu/Hindi, which is hence a resource-poor language. Developing a robust Sentiment analysis system for Roman Urdu/Hindi is necessitated due to two major reasons. First, Urdu/Hindi is the third largest spoken language in the world, with over 500 million speakers. Second, it is becoming increasingly used because people prefer to communicate on the web using Latin Script (26 English Alphabets), instead of typing using their language-specific keyboards.Since the work on Roman Urdu/Hindi sentiment analysis is still in its infancy stage, therefore an urgent development of new techniques and improvements inexisting techniques is required. In particular, the development of an automated technique to address the problem of Roman Urdu/Hindi text normalization is necessary as that widely affects the performance of all Natural Language Processing applications, including Sentiment classification. The non-availability of an annotated dataset is another major issue towards building effective techniques for Roman Urdu/Hindi sentiment analysis.In this thesis, challenging issues hindering the development of effective Roman Urdu/Hindi sentiment classification have been addressed. First, the largest-everdataset of 11000 Roman Urdu/Hindi reviews has been gathered from six different domains, using comprehensive annotation guidelines. Second, a machine learning-based Roman Urdu sentiment analysis is developed using different content-based features. Third, a novel feature selection technique, called Discriminative Feature Spamming Technique, has been developed for Roman Urdu/Hindi sentiment analysis. This technique identifies distinctive features based on a term utility criteria and then further increases their discriminative power by spamming them. The spelling variation problem inherent to Roman Urdu/Hindi adversely affects the performance of the machine learning algorithms. Therefore, in the next step, an open and hard problem of Roman Urdu/Hindi word normalization has been addressed by developing an automated lexical normalizer. The encoder maps differing spellings of a single Roman Urdu/Hindi word to a single common code, via a transliteration-based technique. This technique will have broad implications over different natural language processing applications. In addition, it will provide a concrete foundation to the research community to develop tools to automatically transliterate Urdu to Roman and vice-versa
    corecore