4,555 research outputs found

    A critical assessment of imbalanced class distribution problem: the case of predicting freshmen student attrition

    Get PDF
    Predicting student attrition is an intriguing yet challenging problem for any academic institution. Class-imbalanced data is a common in the field of student retention, mainly because a lot of students register but fewer students drop out. Classification techniques for imbalanced dataset can yield deceivingly high prediction accuracy where the overall predictive accuracy is usually driven by the majority class at the expense of having very poor performance on the crucial minority class. In this study, we compared different data balancing techniques to improve the predictive accuracy in minority class while maintaining satisfactory overall classification performance. Specifically, we tested three balancing techniques—oversampling, under-sampling and synthetic minority over-sampling (SMOTE)—along with four popular classification methods—logistic regression, decision trees, neuron networks and support vector machines. We used a large and feature rich institutional student data (between the years 2005 and 2011) to assess the efficacy of both balancing techniques as well as prediction methods. The results indicated that the support vector machine combined with SMOTE data-balancing technique achieved the best classification performance with a 90.24% overall accuracy on the 10-fold holdout sample. All three data-balancing techniques improved the prediction accuracy for the minority class. Applying sensitivity analyses on developed models, we also identified the most important variables for accurate prediction of student attrition. Application of these models has the potential to accurately predict at-risk students and help reduce student dropout rates

    Oversampling for Imbalanced Learning Based on K-Means and SMOTE

    Full text link
    Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.Comment: 19 pages, 8 figure

    Analysis of group evolution prediction in complex networks

    Full text link
    In the world, in which acceptance and the identification with social communities are highly desired, the ability to predict evolution of groups over time appears to be a vital but very complex research problem. Therefore, we propose a new, adaptable, generic and mutli-stage method for Group Evolution Prediction (GEP) in complex networks, that facilitates reasoning about the future states of the recently discovered groups. The precise GEP modularity enabled us to carry out extensive and versatile empirical studies on many real-world complex / social networks to analyze the impact of numerous setups and parameters like time window type and size, group detection method, evolution chain length, prediction models, etc. Additionally, many new predictive features reflecting the group state at a given time have been identified and tested. Some other research problems like enriching learning evolution chains with external data have been analyzed as well

    Doctor of Philosophy

    Get PDF
    dissertationIn the current business world, data collection for business analysis is not difficult any more. The major concern faced by business managers is whether they can use data to build predictive models so as to provide accurate information for decision-making. Knowledge Discovery from Databases (KDD) provides us a guideline for collecting data through identifying knowledge inside data. As one of the KDD steps, the data mining method provides a systematic and intelligent approach to learning a large amount of data and is critical to the success of KDD. In the past several decades, many different data mining algorithms have been developed and can be categorized as classification, association rule, and clustering. These data mining algorithms have been demonstrated to be very effective in solving different business questions. Among these data mining types, classification is the most popular group and is widely used in all kinds of business areas. However, the exiting classification algorithm is designed to maximize the prediction accuracy given by the assumption of equal class distribution and equal error costs. This assumption seldom holds in the real world. Thus, it is necessary to extend the current classification so that it can deal with the data with the imbalanced distribution and unequal costs. In this dissertation, I propose an Iterative Cost-sensitive Naïve Bayes (ICSNB) method aimed at reducing overall misclassification cost regardless of class distribution. During each iteration, K nearest neighbors are identified and form a new training set, which is used to learn unsolved instances. Using the characteristics of the nearest neighbor method, I also develop a new under-sampling method to solve the imbalance problem in the second study. In the second study, I design a general method to deal with the imbalance problem and identify noisy instances from the data set to create a balanced data set for learning. Both of these two methods are validated using multiple real world data sets. The empirical results show the superior performance of my methods compared to some existing and popular methods
    corecore