10,981 research outputs found

    A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition

    Get PDF
    Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniquesā€”oversampling, under-sampling and synthetic minority over-sampling (SMOTE)ā€”along with four popular classification methodsā€”logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates

    Generative Adversarial Networks for Bitcoin Data Augmentation

    Get PDF
    In Bitcoin entity classification, results are strongly conditioned by the ground-truth dataset, especially when applying supervised machine learning approaches. However, these ground-truth datasets are frequently affected by significant class imbalance as generally they contain much more information regarding legal services (Exchange, Gambling), than regarding services that may be related to illicit activities (Mixer, Service). Class imbalance increases the complexity of applying machine learning techniques and reduces the quality of classification results, especially for underrepresented, but critical classes. In this paper, we propose to address this problem by using Generative Adversarial Networks (GANs) for Bitcoin data augmentation as GANs recently have shown promising results in the domain of image classification. However, there is no "one-fits-all" GAN solution that works for every scenario. In fact, setting GAN training parameters is non-trivial and heavily affects the quality of the generated synthetic data. We therefore evaluate how GAN parameters such as the optimization function, the size of the dataset and the chosen batch size affect GAN implementation for one underrepresented entity class (Mining Pool) and demonstrate how a "good" GAN configuration can be obtained that achieves high similarity between synthetically generated and real Bitcoin address data. To the best of our knowledge, this is the first study presenting GANs as a valid tool for generating synthetic address data for data augmentation in Bitcoin entity classification.Comment: 8 pages, 5 figures, 4 table

    A cognitive based Intrusion detection system

    Full text link
    Intrusion detection is one of the primary mechanisms to provide computer networks with security. With an increase in attacks and growing dependence on various fields such as medicine, commercial, and engineering to give services over a network, securing networks have become a significant issue. The purpose of Intrusion Detection Systems (IDS) is to make models which can recognize regular communications from abnormal ones and take necessary actions. Among different methods in this field, Artificial Neural Networks (ANNs) have been widely used. However, ANN-based IDS, has two main disadvantages: 1- Low detection precision. 2- Weak detection stability. To overcome these issues, this paper proposes a new approach based on Deep Neural Network (DNN. The general mechanism of our model is as follows: first, some of the data in dataset is properly ranked, afterwards, dataset is normalized with Min-Max normalizer to fit in the limited domain. Then dimensionality reduction is applied to decrease the amount of both useless dimensions and computational cost. After the preprocessing part, Mean-Shift clustering algorithm is the used to create different subsets and reduce the complexity of dataset. Based on each subset, two models are trained by Support Vector Machine (SVM) and deep learning method. Between two models for each subset, the model with a higher accuracy is chosen. This idea is inspired from philosophy of divide and conquer. Hence, the DNN can learn each subset quickly and robustly. Finally, to reduce the error from the previous step, an ANN model is trained to gain and use the results in order to be able to predict the attacks. We can reach to 95.4 percent of accuracy. Possessing a simple structure and less number of tunable parameters, the proposed model still has a grand generalization with a high level of accuracy in compared to other methods such as SVM, Bayes network, and STL.Comment: 18 pages, 6 figure

    Coupling different methods for overcoming the class imbalance problem

    Get PDF
    Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357
    • ā€¦
    corecore