10,981 research outputs found
A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition
Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high
prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniquesāoversampling, under-sampling and synthetic minority over-sampling (SMOTE)āalong with four popular classification methodsālogistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates
Generative Adversarial Networks for Bitcoin Data Augmentation
In Bitcoin entity classification, results are strongly conditioned by the
ground-truth dataset, especially when applying supervised machine learning
approaches. However, these ground-truth datasets are frequently affected by
significant class imbalance as generally they contain much more information
regarding legal services (Exchange, Gambling), than regarding services that may
be related to illicit activities (Mixer, Service). Class imbalance increases
the complexity of applying machine learning techniques and reduces the quality
of classification results, especially for underrepresented, but critical
classes.
In this paper, we propose to address this problem by using Generative
Adversarial Networks (GANs) for Bitcoin data augmentation as GANs recently have
shown promising results in the domain of image classification. However, there
is no "one-fits-all" GAN solution that works for every scenario. In fact,
setting GAN training parameters is non-trivial and heavily affects the quality
of the generated synthetic data. We therefore evaluate how GAN parameters such
as the optimization function, the size of the dataset and the chosen batch size
affect GAN implementation for one underrepresented entity class (Mining Pool)
and demonstrate how a "good" GAN configuration can be obtained that achieves
high similarity between synthetically generated and real Bitcoin address data.
To the best of our knowledge, this is the first study presenting GANs as a
valid tool for generating synthetic address data for data augmentation in
Bitcoin entity classification.Comment: 8 pages, 5 figures, 4 table
A cognitive based Intrusion detection system
Intrusion detection is one of the primary mechanisms to provide computer
networks with security. With an increase in attacks and growing dependence on
various fields such as medicine, commercial, and engineering to give services
over a network, securing networks have become a significant issue. The purpose
of Intrusion Detection Systems (IDS) is to make models which can recognize
regular communications from abnormal ones and take necessary actions. Among
different methods in this field, Artificial Neural Networks (ANNs) have been
widely used. However, ANN-based IDS, has two main disadvantages: 1- Low
detection precision. 2- Weak detection stability. To overcome these issues,
this paper proposes a new approach based on Deep Neural Network (DNN. The
general mechanism of our model is as follows: first, some of the data in
dataset is properly ranked, afterwards, dataset is normalized with Min-Max
normalizer to fit in the limited domain. Then dimensionality reduction is
applied to decrease the amount of both useless dimensions and computational
cost. After the preprocessing part, Mean-Shift clustering algorithm is the used
to create different subsets and reduce the complexity of dataset. Based on each
subset, two models are trained by Support Vector Machine (SVM) and deep
learning method. Between two models for each subset, the model with a higher
accuracy is chosen. This idea is inspired from philosophy of divide and
conquer. Hence, the DNN can learn each subset quickly and robustly. Finally, to
reduce the error from the previous step, an ANN model is trained to gain and
use the results in order to be able to predict the attacks. We can reach to
95.4 percent of accuracy. Possessing a simple structure and less number of
tunable parameters, the proposed model still has a grand generalization with a
high level of accuracy in compared to other methods such as SVM, Bayes network,
and STL.Comment: 18 pages, 6 figure
Coupling different methods for overcoming the class imbalance problem
Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical.
Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches.
To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature.
Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357
- ā¦