22,319 research outputs found
A Comparison of Classical Versus Deep Learning Techniques for Abusive Content Detection on Social Media Sites
The automated detection of abusive content on social media websites faces a variety of challenges including imbalanced training sets, the identification of an appropriate feature representation and the selection of optimal classifiers. Classifiers such as support vector machines (SVM), combined with bag of words or ngram feature representation, have traditionally dominated in text classification for decades. With the recent emergence of deep learning and word embeddings, an increasing number of researchers have started to focus on deep neural networks. In this paper, our aim is to explore cutting-edge techniques in automated abusive content detection. We use two deep learning approaches: convolutional neural networks (CNNs) and recurrent neural networks (RNNs). We apply these to 9 public datasets derived from various social media websites. Firstly, we show that word embeddings pre-trained on the same data source as the subsequent classification task improves the prediction accuracy of deep learning models. Secondly, we investigate the impact of different levels of training set imbalances on classifier types. In comparison to the traditional SVM classifier, we identify that although deep learning models can outperform the classification results of the traditional SVM classifier when the associated training dataset is seriously imbalanced, the performance of the SVM classifier can be dramatically improved through the use of oversampling, surpassing the deep learning models. Our work can inform researchers in selecting appropriate text classification strategies in the detection of abusive content, including scenarios where the training datasets suffer from class imbalance
Pre-text Representation Transfer for Deep Learning with Limited Imbalanced Data : Application to CT-based COVID-19 Detection
Annotating medical images for disease detection is often tedious and
expensive. Moreover, the available training samples for a given task are
generally scarce and imbalanced. These conditions are not conducive for
learning effective deep neural models. Hence, it is common to 'transfer' neural
networks trained on natural images to the medical image domain. However, this
paradigm lacks in performance due to the large domain gap between the natural
and medical image data. To address that, we propose a novel concept of Pre-text
Representation Transfer (PRT). In contrast to the conventional transfer
learning, which fine-tunes a source model after replacing its classification
layers, PRT retains the original classification layers and updates the
representation layers through an unsupervised pre-text task. The task is
performed with (original, not synthetic) medical images, without utilizing any
annotations. This enables representation transfer with a large amount of
training data. This high-fidelity representation transfer allows us to use the
resulting model as a more effective feature extractor. Moreover, we can also
subsequently perform the traditional transfer learning with this model. We
devise a collaborative representation based classification layer for the case
when we leverage the model as a feature extractor. We fuse the output of this
layer with the predictions of a model induced with the traditional transfer
learning performed over our pre-text transferred model. The utility of our
technique for limited and imbalanced data classification problem is
demonstrated with an extensive five-fold evaluation for three large-scale
models, tested for five different class-imbalance ratios for CT based COVID-19
detection. Our results show a consistent gain over the conventional transfer
learning with the proposed method.Comment: Best paper at IVCN
SleepEGAN: A GAN-enhanced Ensemble Deep Learning Model for Imbalanced Classification of Sleep Stages
Deep neural networks have played an important role in automatic sleep stage
classification because of their strong representation and in-model feature
transformation abilities. However, class imbalance and individual heterogeneity
which typically exist in raw EEG signals of sleep data can significantly affect
the classification performance of any machine learning algorithms. To solve
these two problems, this paper develops a generative adversarial network
(GAN)-powered ensemble deep learning model, named SleepEGAN, for the imbalanced
classification of sleep stages. To alleviate class imbalance, we propose a new
GAN (called EGAN) architecture adapted to the features of EEG signals for data
augmentation. The generated samples for the minority classes are used in the
training process. In addition, we design a cost-free ensemble learning strategy
to reduce the model estimation variance caused by the heterogeneity between the
validation and test sets, so as to enhance the accuracy and robustness of
prediction performance. We show that the proposed method can improve
classification accuracy compared to several existing state-of-the-art methods
using three public sleep datasets.Comment: 20 pages, 6 figure
Uncertainty-guided Boundary Learning for Imbalanced Social Event Detection
Real-world social events typically exhibit a severe class-imbalance
distribution, which makes the trained detection model encounter a serious
generalization challenge. Most studies solve this problem from the frequency
perspective and emphasize the representation or classifier learning for tail
classes. While in our observation, compared to the rarity of classes, the
calibrated uncertainty estimated from well-trained evidential deep learning
networks better reflects model performance. To this end, we propose a novel
uncertainty-guided class imbalance learning framework - UCL, and its
variant - UCL-EC, for imbalanced social event detection tasks. We aim
to improve the overall model performance by enhancing model generalization to
those uncertain classes. Considering performance degradation usually comes from
misclassifying samples as their confusing neighboring classes, we focus on
boundary learning in latent space and classifier learning with high-quality
uncertainty estimation. First, we design a novel uncertainty-guided contrastive
learning loss, namely UCL and its variant - UCL-EC, to manipulate
distinguishable representation distribution for imbalanced data. During
training, they force all classes, especially uncertain ones, to adaptively
adjust a clear separable boundary in the feature space. Second, to obtain more
robust and accurate class uncertainty, we combine the results of multi-view
evidential classifiers via the Dempster-Shafer theory under the supervision of
an additional calibration method. We conduct experiments on three severely
imbalanced social event datasets including Events2012\_100, Events2018\_100,
and CrisisLexT\_7. Our model significantly improves social event representation
and classification tasks in almost all classes, especially those uncertain
ones.Comment: Accepted by TKDE 202
Deep Over-sampling Framework for Classifying Imbalanced Data
Class imbalance is a challenging issue in practical classification problems
for deep learning models as well as traditional models. Traditionally
successful countermeasures such as synthetic over-sampling have had limited
success with complex, structured data handled by deep learning models. In this
paper, we propose Deep Over-sampling (DOS), a framework for extending the
synthetic over-sampling method to exploit the deep feature space acquired by a
convolutional neural network (CNN). Its key feature is an explicit, supervised
representation learning, for which the training data presents each raw input
sample with a synthetic embedding target in the deep feature space, which is
sampled from the linear subspace of in-class neighbors. We implement an
iterative process of training the CNN and updating the targets, which induces
smaller in-class variance among the embeddings, to increase the discriminative
power of the deep representation. We present an empirical study using public
benchmarks, which shows that the DOS framework not only counteracts class
imbalance better than the existing method, but also improves the performance of
the CNN in the standard, balanced settings
- …