48 research outputs found

    Improving imbalanced classification by anomaly detection

    Get PDF
    Although the anomaly detection problem can be considered as an extreme case of class imbalance problem, very few studies consider improving class imbalance classification with anomaly detection ideas. Most data-level approaches in the imbalanced learning domain aim to introduce more information to the original dataset by generating synthetic samples. However, in this paper, we gain additional information in another way, by introducing additional attributes. We propose to introduce the outlier score and four types of samples (safe, borderline, rare, outlier) as additional attributes in order to gain more information on the data characteristics and improve the classification performance. According to our experimental results, introducing additional attributes can improve the imbalanced classification performance in most cases (6 out of 7 datasets). Further study shows that this performance improvement is mainly contributed by a more accurate classification in the overlapping region of the two classes (majority and minority classes). The proposed idea of introducing additional attributes is simple to implement and can be combined with resampling techniques and other algorithmic-level approaches in the imbalanced learning domain.Horizon 2020(H2020)Algorithms and the Foundations of Software technolog

    MTDOT: A Multilingual Translation-Based Data Augmentation Technique for Offensive Content Identification in Tamil Text Data

    No full text
    The posting of offensive content in regional languages has increased as a result of the accessibility of low-cost internet and the widespread use of online social media. Despite the large number of comments available online, only a small percentage of them are offensive, resulting in an unequal distribution of offensive and non-offensive comments. Due to this class imbalance, classifiers may be biased toward the class with the most samples, i.e., the non-offensive class. To address class imbalance, a Multilingual Translation-based Data augmentation technique for Offensive content identification in Tamil text data (MTDOT) is proposed in this work. The proposed MTDOT method is applied to HASOC’21, which is the Tamil offensive content dataset. To obtain a balanced dataset, each offensive comment is augmented using multi-level back translation with English and Malayalam as intermediate languages. Another balanced dataset is generated by employing single-level back translation with Malayalam, Kannada, and Telugu as intermediate languages. While both approaches are equally effective, the proposed multi-level back-translation data augmentation approach produces more diverse data, which is evident from the BLEU score. The MTDOT technique proposed in this work achieved a promising improvement in F1-score over the widely used SMOTE class balancing method by 65%

    Genome-Wide Association Study of Hepatitis in Korean Populations

    No full text

    MTDOT: A Multilingual Translation-Based Data Augmentation Technique for Offensive Content Identification in Tamil Text Data

    No full text
    The posting of offensive content in regional languages has increased as a result of the accessibility of low-cost internet and the widespread use of online social media. Despite the large number of comments available online, only a small percentage of them are offensive, resulting in an unequal distribution of offensive and non-offensive comments. Due to this class imbalance, classifiers may be biased toward the class with the most samples, i.e., the non-offensive class. To address class imbalance, a Multilingual Translation-based Data augmentation technique for Offensive content identification in Tamil text data (MTDOT) is proposed in this work. The proposed MTDOT method is applied to HASOC’21, which is the Tamil offensive content dataset. To obtain a balanced dataset, each offensive comment is augmented using multi-level back translation with English and Malayalam as intermediate languages. Another balanced dataset is generated by employing single-level back translation with Malayalam, Kannada, and Telugu as intermediate languages. While both approaches are equally effective, the proposed multi-level back-translation data augmentation approach produces more diverse data, which is evident from the BLEU score. The MTDOT technique proposed in this work achieved a promising improvement in F1-score over the widely used SMOTE class balancing method by 65%

    GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning

    No full text
    Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure–activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.ISSN:1549-9596ISSN:0095-2338ISSN:1520-514
    corecore