372 research outputs found

    ADAPTATION OF DOMAIN-SPECIFIC TRANSFORMER MODELS WITH TEXT OVERSAMPLING FOR SENTIMENT ANALYSIS OF SOCIAL MEDIA POSTS ON COVID-19 VACCINE

    Get PDF
    Covid-19 has spread across the world and many different vaccines have been developed to counter its surge. To identify the correct sentiments associated with the vaccines from social media posts, this paper aims to fine-tune pre-trained transformer models on tweets associated with different Covid vaccines, specifically RoBERTa, XLNet and BERT which are recently introduced state-of-the-art bi-directional transformer models, and domain-specific transformer models BERTweet and CT-BERT that are pre-trained on Covid-19 tweets. We further explore the option of data augmentation by text oversampling using LMOTE to improve the accuracies of these models, specifically, for small sample datasets where there is an imbalanced class distribution among the positive, negative and neutral sentiment classes. Our results summarize our findings on the suitability of text oversampling for imbalanced, small sample datasets that are used to fine-tune state-of-the-art pre-trained transformer models, and the utility of having domain-specific transformer models for the classification task

    Adaptation of domain-specific transformer models with text oversampling for sentiment analysis of social media posts on Covid-19 vaccines

    Full text link
    Covid-19 has spread across the world and several vaccines have been developed to counter its surge. To identify the correct sentiments associated with the vaccines from social media posts, we fine-tune various state-of-the-art pre-trained transformer models on tweets associated with Covid-19 vaccines. Specifically, we use the recently introduced state-of-the-art pre-trained transformer models RoBERTa, XLNet and BERT, and the domain-specific transformer models CT-BERT and BERTweet that are pre-trained on Covid-19 tweets. We further explore the option of text augmentation by oversampling using Language Model based Oversampling Technique (LMOTE) to improve the accuracies of these models, specifically, for small sample datasets where there is an imbalanced class distribution among the positive, negative and neutral sentiment classes. Our results summarize our findings on the suitability of text oversampling for imbalanced small sample datasets that are used to fine-tune state-of-the-art pre-trained transformer models, and the utility of domain-specific transformer models for the classification task.Comment: The paper has been accepted for publication in Computer Science journal: http://journals.agh.edu.pl/csc

    The Majority Report - Can we use big data to secure a better future?

    Get PDF
    With the widely adopted use of social media, it now becomes a common platform for calling supporters for civil unrest events. Despite the noble aims of these civil unrest events, sometimes these events might turn violent and disturb the daily lives of the general public. This paper aims to propose a conceptual framework regarding the study of using online social media data to predict offline civil unrest events. We propose to use time-series metrics as the prediction attributes instead of analyzing message contents because the message contents on social media are usually noisy, informal and not so easy to interpret. In the case of a data set containing both civil unrest event dates and normal dates, we found that it contains many more samples from the normal dates class than from the civil unrest event dates class. Thus, creating an imbalanced class problem. We showed using accuracy as the performance metrics could be misleading as civil unrest events were the minority class. Thus, we suggest to use additional tactics to handle the imbalanced class prediction problem. We propose to use a combination of oversampling the minority class and using feature selection techniques to tackle the imbalanced class problem. The current results demonstrate that use of time-series metrics to predict civil unrest events is a possible solution to the problems of handling the noise and unstructured format of social media data contents in the process of analysis and predictions. In addition, we have showed that the combination of special techniques to handle imbalanced class outperformed other classifiers without using such techniques.published_or_final_versio

    Detection of Offensive YouTube Comments, a Performance Comparison of Deep Learning Approaches

    Get PDF
    Social media data is open, free and available in massive quantities. However, there is a significant limitation in making sense of this data because of its high volume, variety, uncertain veracity, velocity, value and variability. This work provides a comprehensive framework of text processing and analysis performed on YouTube comments having offensive and non-offensive contents. YouTube is a platform where every age group of people logs in and finds the type of content that most appeals to them. Apart from this, a massive increase in the use of offensive language has been apparent. As there are massive volume of new comments, each comment cannot be removed manually or it will be bad for business for youtubers if they make their comment section unavailable as they will not be able to get any feedback of any kind

    A HYBRID DEEP LEARNING APPROACH FOR SENTIMENT ANALYSIS IN PRODUCT REVIEWS

    Get PDF
    Product reviews play a crucial role in providing valuable insights to consumers and producers. Analyzing the vast amount of data generated around a product, such as posts, comments, and views, can be challenging for business intelligence purposes. Sentiment analysis of this content helps both consumers and producers gain a better understanding of the market status, enabling them to make informed decisions. In this study, we propose a novel hybrid approach based on deep neural networks (DNNs) for sentiment analysis in product reviews, focusing on the classification of sentiments expressed. Our approach utilizes the recursive neural network (RNN) algorithm for sentiment classification. To address the imbalanced distribution of positive and negative samples in social network data, we employ a resampling technique that balances the dataset by increasing samples from the minority class and decreasing samples from the majority class. We evaluate our approach using Amazon data, comprising four product categories: clothing, cars, luxury goods, and household appliances. Experimental results demonstrate that our proposed approach performs well in sentiment analysis for product reviews, particularly in the context of digital marketing. Furthermore, the attention-based RNN algorithm outperforms the baseline RNN by approximately 5%. Notably, the study reveals consumer sentiment variations across different products, particularly in relation to appearance and price aspects

    Class-Decomposition and Augmentation for Imbalanced Data Sentiment Analysis

    Get PDF
    Significant progress has been made in the area of text classification and natural language processing. However, like many other datasets from across different domains, text-based datasets may suffer from class-imbalance. This problem leads to model's bias toward the majority class instances. In this paper, we present a new approach to handle class-imbalance in text data by means of unsupervised learning algorithms. We present class-decomposition using two different unsupervised methods, namely k-means and Density-Based Spatial Clustering of Applications with Noise, applied to two different sentiment analysis data sets. The experimental results show that utilizing clustering to find within-class similarities can lead to significant improvement in learning algorithm's performances as well as reducing the dominance of the majority class instances without causing information loss

    Irony Detection in Twitter with Imbalanced Class Distributions

    Full text link
    [EN] Irony detection is a not trivial problem and can help to improve natural language processing tasks as sentiment analysis. When dealing with social media data in real scenarios, an important issue to address is data skew, i.e. the imbalance between available ironic and non-ironic samples available. In this work, the main objective is to address irony detection in Twitter considering various degrees of imbalanced distribution between classes. We rely on the emotIDM irony detection model. We evaluated it against both benchmark corpora and skewed Twitter datasets collected to simulate a realistic distribution of ironic tweets. We carry out a set of classification experiments aimed to determine the impact of class imbalance on detecting irony, and we evaluate the performance of irony detection when different scenarios are considered. We experiment with a set of classifiers applying class imbalance techniques to compensate class distribution. Our results indicate that by using such techniques, it is possible to improve the performance of irony detection in imbalanced class scenarios.The first author was funded by CONACYT project FC-2016/2410. Ronaldo Prati was supported by the São Paulo State (Brazil) research council FAPESP under project 2015/20606-6. Francisco Herrera was partially supported by the Spanish National Research Project TIN2017-89517-P. The work of Paolo Rosso was partially supported by the Spanish MICINN under the research project MISMIS (PGC2018-096212- B-C31) and by the Generalitat Valenciana under the grant PROMETEO/2019/121.Hernandez-Farias, DI.; Prati, R.; Herrera, F.; Rosso, P. (2020). Irony Detection in Twitter with Imbalanced Class Distributions. Journal of Intelligent & Fuzzy Systems. 39(2):2147-2163. https://doi.org/10.3233/JIFS-179880S21472163392Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20-29. doi:10.1145/1007730.1007735Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. doi:10.1613/jair.953Fernández A. , García S. , Galar M. , Prati R.C. , Krawczyk B. and Herrera F. , Learning from imbalanced data sets, Springer, (2018).Haibo He, & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. doi:10.1109/tkde.2008.239Farías, D. I. H., Patti, V., & Rosso, P. (2016). Irony Detection in Twitter. ACM Transactions on Internet Technology, 16(3), 1-24. doi:10.1145/2930663Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study1. Intelligent Data Analysis, 6(5), 429-449. doi:10.3233/ida-2002-6504Kumon-Nakamura, S., Glucksberg, S., & Brown, M. (1995). How about another piece of pie: The allusional pretense theory of discourse irony. Journal of Experimental Psychology: General, 124(1), 3-21. doi:10.1037/0096-3445.124.1.3López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141. doi:10.1016/j.ins.2013.07.007Mohammad, S. M., & Turney, P. D. (2012). CROWDSOURCING A WORD-EMOTION ASSOCIATION LEXICON. Computational Intelligence, 29(3), 436-465. doi:10.1111/j.1467-8640.2012.00460.xMohammad, S. M., Zhu, X., Kiritchenko, S., & Martin, J. (2015). Sentiment, emotion, purpose, and style in electoral tweets. Information Processing & Management, 51(4), 480-499. doi:10.1016/j.ipm.2014.09.003Poria, S., Gelbukh, A., Hussain, A., Howard, N., Das, D., & Bandyopadhyay, S. (2013). Enhanced SenticNet with Affective Labels for Concept-Based Opinion Mining. IEEE Intelligent Systems, 28(2), 31-38. doi:10.1109/mis.2013.4Prati, R. C., Batista, G. E. A. P. A., & Silva, D. F. (2014). Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowledge and Information Systems, 45(1), 247-270. doi:10.1007/s10115-014-0794-3Reyes, A., Rosso, P., & Veale, T. (2012). A multidimensional approach for detecting irony in Twitter. Language Resources and Evaluation, 47(1), 239-268. doi:10.1007/s10579-012-9196-xSulis, E., Irazú Hernández Farías, D., Rosso, P., Patti, V., & Ruffo, G. (2016). Figurative messages and affect in Twitter: Differences between #irony, #sarcasm and #not. Knowledge-Based Systems, 108, 132-143. doi:10.1016/j.knosys.2016.05.035Utsumi, A. (2000). Verbal irony as implicit display of ironic environment: Distinguishing ironic utterances from nonirony. Journal of Pragmatics, 32(12), 1777-1806. doi:10.1016/s0378-2166(99)00116-2Whissell, C. (2009). Using the Revised Dictionary of Affect in Language to Quantify the Emotional Undertones of Samples of Natural Language. Psychological Reports, 105(2), 509-521. doi:10.2466/pr0.105.2.509-521Wilson, D., & Sperber, D. (1992). On verbal irony. Lingua, 87(1-2), 53-76. doi:10.1016/0024-3841(92)90025-

    Relationship Between Personality Patterns and Harmfulness : Analysis and Prediction Based on Sentence Embedding

    Get PDF
    This paper hypothesizes that harmful utterances need to be judged in the context of whole sentences, and the authors extract features of harmful expressions using a general-purpose language model. Based on the extracted features, the authors propose a method to predict the presence or absence of harmful categories. In addition, the authors believe that it is possible to analyze users who incite others by combining this method with research on analyzing the personality of the speaker from statements on social networking sites. The results confirmed that the proposed method can judge the possibility of harmful comments with higher accuracy than simple dictionary-based models or models using a distributed representation of words. The relationship between personality patterns and harmful expressions was also confirmed by an analysis based on a harmful judgment model

    A performance comparison of oversampling methods for data generation in imbalanced learning tasks

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Marketing Research e CRMClass Imbalance problem is one of the most fundamental challenges faced by the machine learning community. The imbalance refers to number of instances in the class of interest being relatively low, as compared to the rest of the data. Sampling is a common technique for dealing with this problem. A number of over - sampling approaches have been applied in an attempt to balance the classes. This study provides an overview of the issue of class imbalance and attempts to examine some common oversampling approaches for dealing with this problem. In order to illustrate the differences, an experiment is conducted using multiple simulated data sets for comparing the performance of these oversampling methods on different classifiers based on various evaluation criteria. In addition, the effect of different parameters, such as number of features and imbalance ratio, on the classifier performance is also evaluated
    corecore