10 research outputs found

    ANALYZING THE IMPACT OF RESAMPLING METHOD FOR IMBALANCED DATA TEXT IN INDONESIAN SCIENTIFIC ARTICLES CATEGORIZATION

    Get PDF
    The extremely skewed data in artificial intelligence, machine learning, and data mining cases are often given misleading results. It is caused because machine learning algorithms are designated to work best with balanced data. However, we often meet with imbalanced data in the real situation. To handling imbalanced data issues, the most popular technique is resampling the dataset to modify the number of instances in the majority and minority classes into a standard balanced data. Many resampling techniques, oversampling, undersampling, or combined both of them, have been proposed and continue until now. Resampling techniques may increase or decrease the classifier performance. Comparative research on resampling methods in structured data has been widely carried out, but studies that compare resampling methods with unstructured data are very rarely conducted. That raises many questions, one of which is whether this method is applied to unstructured data such as text that has large dimensions and very diverse characters. To understand how different resampling techniques will affect the learning of classifiers for imbalanced data text, we perform an experimental analysis using various resampling methods with several classification algorithms to classify articles at the Indonesian Scientific Journal Database (ISJD). From this experiment, it is known resampling techniques on imbalanced data text generally to improve the classifier performance but they are doesn’t give significant result because data text has very diverse and large dimensions

    Class-wise Calibration: A Case Study on COVID-19 Hate Speech

    Get PDF
    Proper calibration of deep-learning models is critical for many high-stakes problems. In this paper, we show that existing calibration metrics fail to pay attention to miscalibration on individual classes, hence overlooking minority classes and causing significant issues on imbalanced classification problems. Using a COVID-19 hate-speech dataset, we first discover that in imbalanced datasets, miscalibration error on an individual class varies greatly, and error on minority classes can be magnitude times worse than what is suggested by the overall calibration performance. To address this issue, we propose a new metric based on expected miscalibration error, named as Contraharmonic Expected Calibration Error (CECE), which punishes severe miscalibration on individual classes. We further devise a novel variant of temperature scaling for imbalanced data to improve class-wise miscalibration, which re-weights the loss function by the inverse class count to tune the scaling parameter to reduce worst-case minority calibration error. Our case study on a benchmarking COVID-19 hate speech task shows the effectiveness of our calibration metric and our temperature scaling strategy

    IEEE J Biomed Health Inform

    Get PDF
    Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.P30 CA177558/CA/NCI NIH HHSUnited States/U58 DP003907/DP/NCCDPHP CDC HHSUnited States/2022-10-05T00:00:00Z35020599PMC953324711987vault:4335

    Analisis Emosi pada Media Sosial Twitter Menggunakan Metode Multinomial Naive Bayes dan Synthetic Minority Oversampling Technique

    Get PDF
    Media sosial Twitter sering digunakan untuk mengekspresikan emosi seseorang melalui sebuah cuitan. Penelitian tentang analisis emosi dalam media social twitter sudah banyak dilakukan. Mesin learning adalah tools yang banyak digunakan untuk melakukan pengkategorian emosi. Namun, ketidakseimbangan jumlah data antar kelas sering jadi masalah. Maka, penelitian ini bertujuan untuk mengetahui performansi hasil gabungan metode Multinomial Naïve Bayes (MNB) dan Synthetic Minority Oversampling Technique (SMOTE) untuk analisis emosi cuitan dari media sosial Twitter. Setiap cuitan melaui prepocessing data pada penelitian ini meliputi case folding, data cleaning, convert slangword, convert negation, tokenization, stopword removal, dan stemming. Untuk ekstraksi fitur digunakan metode n-gram dan untuk pembobotan fitur digunakan metode term frequency. Pengujian dilakukan menggunakan K-Fold Cross Validation. Berdasarkan hasil pengujian, dengan menggunakan SMOTE diperoleh rata-rata akurasi sebesar 0.65 atau 65% dan nilai rata-rata f1-score sebesar 0.66 atau 66%. Sedangkan tanpa SMOTE diperoleh rata-rata akurasi 0.64 atau 64% dan rata-rata f1-score sebesar 0.65 atau 65%. Walaupun dalam penelitian ini dapat ditunjukan hasil dengan menggunakan SMOTE lebih baik 1% dalam pengkategorian emosi. Tetapi hasil yang diperoleh belum maksimal, masih perlu dikaji lagi untuk metode penyeimbangan data dan mesin learning yang lain.Twitter social media is often used to express one's emotions through tweets. Much research has been conducted on emotional analysis in the social media Twitter. Machine learning is a tool that is widely used to categorize emotions. However, an imbalance in the amount of data between classes is often a problem. So, this research aims to determine the performance of the combined Multinomial Naïve Bayes (MNB) and Synthetic Minority Oversampling Technique (SMOTE) methods for emotional analysis of tweets from the social media Twitter. Each tweet through data preprocessing in this research includes case folding, data cleaning, convert slangword, convert negation, tokenization, stopword removal, and stemming. For feature extraction the n-gram method is used and for feature weighting the term frequency method is used. Testing was carried out using K-Fold Cross Validation. Based on the test results, using SMOTE an average accuracy of 0.65 or 65% was obtained and an average f1-score value of 0.66 or 66%. Meanwhile, without SMOTE, an average accuracy of 0.64 or 64% was obtained and an average f1-score of 0.65 or 65%. Although in this study it can be shown that the results using SMOTE are 1% better in categorizing emotions. However, the results obtained are not optimal, and other methods of data balancing and machine learning still need to be studied

    J Biomed Inform

    Get PDF
    In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.HHSN261201800032C/CA/NCI NIH HHSUnited States/HHSN261201800009C/CA/NCI NIH HHSUnited States/NU58DP006344/DP/NCCDPHP CDC HHSUnited States/HHSN261201800015I/CA/NCI NIH HHSUnited States/HHSN261201800013C/CA/NCI NIH HHSUnited States/HHSN261201800016I/CA/NCI NIH HHSUnited States/HHSN261201800014I/CA/NCI NIH HHSUnited States/HHSN261201800032I/CA/NCI NIH HHSUnited States/HHSN261201800013I/HL/NHLBI NIH HHSUnited States/U58 DP003907/DP/NCCDPHP CDC HHSUnited States/HHSN261201800015C/CA/NCI NIH HHSUnited States/HHSN261201800013I/CA/NCI NIH HHSUnited States/HHSN261201800014C/CA/NCI NIH HHSUnited States/HHSN261201800016C/CA/NCI NIH HHSUnited States/P30 CA177558/CA/NCI NIH HHSUnited States/HHSN261201300021C/CA/NCI NIH HHSUnited States/HHSN261201800009I/CA/NCI NIH HHSUnited States/HHSN261201800007C/CA/NCI NIH HHSUnited States

    Harnessing generative AI for overcoming labeled data challenges in social media NLP

    Get PDF
    With the introduction of Transformers and Large Language Models, the field of NLP has significantly evolved. Generative AI, a prominent transformer-based technology for crafting human-like content, has proven powerful skills across numerous NLP tasks. Simultaneously, social media emerges as a rich source for NLP explorations, offering vast and diverse datasets that capture real-time language usage, making it a valuable resource for understanding and advancing NLP techniques. Given that supervised learning is the most popular Machine Learning training method, numerous NLP studies necessitate labor-intensive annotation of social media text. However, despite the large amount of data available, the social media data annotation process is usually difficult for human experts due to unique characteristics of text, such as shortness, lack of context, embedded socio-cultural perspectives, and varied writing styles. The challenges in constructing labeled social media datasets often result in a scarcity of labeled data and the generation of low-quality labels. Moreover, these datasets frequently face class imbalance due to the limitations of labeled samples. Hence, ensuring a balanced, high-quality dataset in sufficient quantities is crucial for the robust and accurate development of NLP models. To address these challenges, this study has identified the usage of generative AI for social media labeled text generation. Specifically, this study focuses on two key objectives: augmenting existing labeled text samples and annotating unlabeled text samples using generative AI. As the generative AI technology, the Generative Pre-trained Transformer model, a prevalent choice for AI-based content generation is employed in different versions throughout the study and evaluated its performance against traditional text augmentation and annotation methods. While both studies centered around multi-class classification problems, the text augmentation approach delves into augmenting human wellness dimensions using Reddit posts, and text annotation tackles stance detection on abortion legalization using Twitter posts. By employing various classifiers, the subsequent investigations aim to enhance classification performance in social media NLP, emphasizing the common goal of expanding labeled datasets, while enhancing the quality of labels

    Malicious Interlocutor Detection Using Forensic Analysis of Historic Data

    Get PDF
    The on-going problem of child grooming online grows year on year and whilst government legislation looks to combat the issue by levying heavier penalties on perpetrators of online grooming, crime figures still increase. Government guidance directed towards digital platforms and social media providers places emphasis on child safety online. As this research shows, government initiatives have proved somewhat ineffective. Therefore, the aim of this research is to investigate the scale of the of the problem and test a variety of machine learning and deep learning techniques that could be used in a novel intelligent solution to protect children from online predation. The heterogeneity of online platforms means that a one size fits all solution presents a complex problem that needs to be solved. The maturity of intelligent approaches to Natural Language Processing makes it possible to analyse and process text data in a wide variety of ways. Pre-processing data enables the preparation of text data in a format that machines can understand and reason about without the need for human interaction. The on-going development of Machine Learning and Deep Learning architectures enables the construction of intelligent solutions that can classify text data in ways never imagined. This thesis presents research that tests the application of potential intelligent solutions such as Artificial Neural Networks and Machine Learning algorithms applied in Natural Language Processing. The research also tests the performance of pre-processing workflows and the impact of pre-processing of both online grooming and more general chat corpora. The storage and processing of data via a traditional relational database management system has also been tested for suitability when looking to detect grooming conversation in historical data. The on-going development of Machine Learning and Deep Learning architectures enables the construction of intelligent solutions that can classify text data in ways never imagined. This thesis presents research that tests the application of potential intelligent solutions such as Artificial Neural Networks and Machine Learning algorithms applied in Natural Language Processing. The research also tests the performance of pre-processing workflows and the impact of pre-processing of both online grooming and more general chat corpora. The storage and processing of data via a traditional relational database management system has also been tested for suitability when looking to detect grooming conversation in historical data. Document similarity measures such as Cosine Similarity and Support Vector Machines have displayed positive results in identifying grooming conversation, however, a more intelligent solution may prove to have better currency in developing a smart autonomous solution given the ever-evolving lexicon used by participants in online chat conversations
    corecore