1,876 research outputs found

    Decision tree rule-based feature selection for imbalanced data

    Get PDF
    A class imbalance problem appears in many real world applications, e.g., fault diagnosis, text categorization and fraud detection. When dealing with an imbalanced dataset, feature selection becomes an important issue. To address it, this work proposes a feature selection method that is based on a decision tree rule and weighted Gini index. The effectiveness of the proposed methods is verified by classifying a dataset from Santander Bank and two datasets from UCI machine learning repository. The results show that our methods can achieve higher Area Under the Curve (AUC) and F-measure. We also compare them with filter-based feature selection approaches, i.e., Chi-Square and F-statistic. The results show that they outperform them but need slightly more computational efforts

    Massive Open Online Courses Temporal Profiling for Dropout Prediction

    Get PDF
    Massive Open Online Courses (MOOCs) are attracting the attention of people all over the world. Regardless the platform, numbers of registrants for online courses are impressive but in the same time, completion rates are disappointing. Understanding the mechanisms of dropping out based on the learner profile arises as a crucial task in MOOCs, since it will allow intervening at the right moment in order to assist the learner in completing the course. In this paper, the dropout behaviour of learners in a MOOC is thoroughly studied by first extracting features that describe the behavior of learners within the course and then by comparing three classifiers (Logistic Regression, Random Forest and AdaBoost) in two tasks: predicting which users will have dropped out by a certain week and predicting which users will drop out on a specific week. The former has showed to be considerably easier, with all three classifiers performing equally well. However, the accuracy for the second task is lower, and Logistic Regression tends to perform slightly better than the other two algorithms. We found that features that reflect an active attitude of the user towards the MOOC, such as submitting their assignment, posting on the Forum and filling their Profile, are strong indicators of persistence.Comment: 8 pages, ICTAI1

    Experimental evaluation of ensemble classifiers for imbalance in Big Data

    Get PDF
    Datasets are growing in size and complexity at a pace never seen before, forming ever larger datasets known as Big Data. A common problem for classification, especially in Big Data, is that the numerous examples of the different classes might not be balanced. Some decades ago, imbalanced classification was therefore introduced, to correct the tendency of classifiers that show bias in favor of the majority class and that ignore the minority one. To date, although the number of imbalanced classification methods have increased, they continue to focus on normal-sized datasets and not on the new reality of Big Data. In this paper, in-depth experimentation with ensemble classifiers is conducted in the context of imbalanced Big Data classification, using two popular ensemble families (Bagging and Boosting) and different resampling methods. All the experimentation was launched in Spark clusters, comparing ensemble performance and execution times with statistical test results, including the newest ones based on the Bayesian approach. One very interesting conclusion from the study was that simpler methods applied to unbalanced datasets in the context of Big Data provided better results than complex methods. The additional complexity of some of the sophisticated methods, which appear necessary to process and to reduce imbalance in normal-sized datasets were not effective for imbalanced Big Data.“la Caixa” Foundation, Spain, under agreement LCF/PR/PR18/51130007. This work was supported by the Junta de Castilla y León, Spain under project BU055P20 (JCyL/FEDER, UE) co-financed through European Union FEDER funds, and by the Consejería de Educación of the Junta de Castilla y León and the European Social Fund, Spain through a pre-doctoral grant (EDU/1100/2017)

    A Modified Boosted Ensemble Classifier on Location Based Social Networking

    Get PDF
    One of the research issues that researchers are interested in is unbalanced data classification techniques. Boosting approaches like Wang\u27s Boosting and Modified Boosted SVM (MBSVM) have been demonstrated to be more effective for unbalanced data. Our proposal The Modified Boosted Random Forest (MBRF) classifier is a Random Forest classifier that uses the Boosting approach. The main motivation of the study is to analyze sentiment of geotagged tweets understanding the state of mind of people at FIFA and Olympics datasets. Tree based model Random Forest algorithm using boosting approach classifies the tweets to build a recommendation system with an idea of providing commercial suggestions to participants, recommending local places to visit or perform activities. MBRF employs various strategies: i) a distance-based weight-update method based on K-Medoids ii) a sign-based classifier elimination technique. We have equally partitioned the datasets as 70% of data allocated for training and the remaining 30% data as test data. Our imbalanced data ratio measured 3.1666 and 4.6 for FIFA and Olympics datasets. We looked at accuracy, precision, recall and ROC curves for each event. The average AUC achieved by MBRF on FIFA dataset is 0.96 and Olympics is 0.97. A comparison of MBRF and Decision tree model using \u27Entropy\u27 proved MBRF better

    Developing A Machine Learning Based Approach For Fractured Zone Detection By Using Petrophysical Logs

    Get PDF
    Oil reservoirs are divided into three categories: carbonate (fractured), sandstone and unconventional reservoirs. Identification and modeling of fractures in fractured reservoirs are so important due to geomechanical issues, fluid flood simulation and enhanced oil recovery.Image and petrophysical logs are individual tools, run inside oil wells, to achieve physical characteristics of reservoirs, e.g. geological rock types, porosity, and permeability. Fractures could be distinguished using image logs because of their higher resolution. Image logs are an expensive and newly developed tool, so they have run in limited wells, whereas petrophysical logs are usually run inside the wells. Lack of image logs makes huge difficulties in fracture detection, as well as fracture studies. In the last decade, a few studies were done to distinguish fractured zones in oil wells, by applying data mining methods over petrophysical logs. The goal of this study was also discrimination of fractured/non-fractured zones by using machine learning techniques and petrophysical logs. To do that, interpretation of image logs was utilized to label reservoir depth of studied wells as 0 (non-fractured zone) and 1 (fractured zone). We developed four classifiers (Deep Learning, Support Vector Machine, Decision Tree, and Random Forest) and applied them to petrophysics logs to discriminate fractured/non-fractured zones. Ordered Weighted Averaging was the data fusion method that we utilized to integrate outputs of classifiers in order to achieve unique and more reliable results. Overall, the frequency of non-fractured zones is about two times of fractured zones. This leads to an imbalanced condition between two classes. Therefore, the aforementioned procedure relied on the balance/imbalance data to investigate the influence of creating a balanced situation between classes. Results showed that Random Forest and Support Vector Machines are better classifiers with above 95 percent accuracy in discrimination of fractured/non-fractured zones. Meanwhile, making a balanced situation in the wells by a higher imbalance index helps to distinguish either non-fractured or fractured zones. Through imbalance data, non-fractured zones (dominant class) could be perfectly distinguished, while a significant percentage of fractured zones were also labeled as non-fractured ones

    Predicting class-imbalanced business risk using resampling, regularization, and model ensembling algorithms

    Get PDF
    We aim at developing and improving the imbalanced business risk modeling via jointly using proper evaluation criteria, resampling, cross-validation, classifier regularization, and ensembling techniques. Area Under the Receiver Operating Characteristic Curve (AUC of ROC) is used for model comparison based on 10-fold cross-validation. Two undersampling strategies including random undersampling (RUS) and cluster centroid undersampling (CCUS), as well as two oversampling methods including random oversampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE), are applied. Three highly interpretable classifiers, including logistic regression without regularization (LR), L1-regularized LR (L1LR), and decision tree (DT) are implemented. Two ensembling techniques, including Bagging and Boosting, are applied to the DT classifier for further model improvement. The results show that Boosting on DT by using the oversampled data containing 50% positives via SMOTE is the optimal model and it can achieve AUC, recall, and F1 score valued 0.8633, 0.9260, and 0.8907, respectively

    Studying and handling iterated algorithmic biases in human and machine learning interaction.

    Get PDF
    Algorithmic bias consists of biased predictions born from ingesting unchecked information, such as biased samples and biased labels. Furthermore, the interaction between people and algorithms can exacerbate bias such that neither the human nor the algorithms receive unbiased data. Thus, algorithmic bias can be introduced not only before and after the machine learning process but sometimes also in the middle of the learning process. With a handful of exceptions, only a few categories of bias have been studied in Machine Learning, and there are few, if any, studies of the impact of bias on both human behavior and algorithm performance. Although most research treats algorithmic bias as a static factor, we argue that algorithmic bias interacts with humans in an iterative manner producing a long-term effect on algorithms\u27 performance. Recommender systems involve the natural interaction between humans and machine learning algorithms that may introduce bias over time during a continuous feedback loop, leading to increasingly biased recommendations. Therefore, in this work, we view a Recommender system environment as generating a continuous chain of events as a result of the interactions between users and the recommender system outputs over time. For this purpose, In the first part of this dissertation, we employ an iterated-learning framework that is inspired from human language evolution to study the impact of interaction between machine learning algorithms and humans. Specifically, our goal is to study the impact of the interaction between two sources of bias: the process by which people select information to label (human action); and the process by which an algorithm selects the subset of information to present to people (iterated algorithmic bias mode). Specifically, we investigate three forms of iterated algorithmic bias (i.e. personalization filter, active learning, and a random baseline) and how they affect the behavior of machine learning algorithms. Our controlled experiments which simulate content-based filters, demonstrate that the three iterated bias modes, initial training data class imbalance, and human action affect the models learned by machine learning algorithms. We also found that iterated filter bias, which is prominent in personalized user interfaces, can lead to increased inequality in estimated relevance and to a limited human ability to discover relevant data. In the second part of this dissertation work, we focus on collaborative filtering recommender systems which suffer from additional biases due to the popularity of certain items, which when coupled with the iterated bias emerging from the feedback loop between human and algorithms, leads to an increased divide between the popular items (the haves) and the unpopular items (the have-nots). We thus propose several debiasing algorithms, including a novel blind spot aware matrix factorization algorithm, and evaluate how our proposed algorithms impact both prediction accuracy and the trends of increase or decrease in the inequality of the popularity distribution of items over time. Our findings indicate that the relevance blind spot (items from the testing set whose predicted relevance probability is less than 0.5) amounted to 4\% of all relevant items when using a content-based filter that predicts relevant items. A similar simulation using a real-life rating data set found that the same filter resulted in a blind spot size of 75\% of the relevant testing set. In the case of collaborative filtering for synthetic rating data, and when using 20 latent factors, Conventional Matrix Factorization resulted in a ranking-based blind spot (items whose predicted ratings are below 90\% of the maximum predicted ratings) ranging between 95\% and 99\% of all items on average. Both Propensity-based Matrix Factorization methods resulted in blind spots consisting of between 94\% and 96\% of all items; while the Blind spot aware Matrix Factorization resulted in a ranking-based blind spot with around 90\% to 94\% of all items. For a semi-synthetic data (a real rating data completed with Matrix Factorization), Matrix Factorization using 20 latent factors, resulted in a ranking-based blind spot containing between 95\% and 99\% of all items. Popularity-based and Poisson based propensity-based Matrix Factorization resulted in a ranking-based blind spot with between 96\% and 97\% if all items; while the blind spot aware Matrix Factorization resulted in a ranking-based blind spot with between 92\% and 96\% of all items. Considering that recommender systems are typically used as gateways that filter massive amounts of information (in the millions) for relevance, these blind spot percentage result differences (every 1\% amounts to tens of thousands of items or options) show that debiasing these systems can have significant repercussions on the amount of information and the space of options that can be discovered by humans who interact with algorithmic filters

    Effective Feature Selection Methods for User Sentiment Analysis using Machine Learning

    Get PDF
    Text classification is the method of allocating a particular piece of text to one or more of a number of predetermined categories or labels. This is done by training a machine learning model on a labeled dataset, where the texts and their corresponding labels are provided. The model then learns to predict the labels of new, unseen texts. Feature selection is a significant step in text classification as it helps to identify the most relevant features or words in the text that are useful for predicting the label. This can include things like specific keywords or phrases, or even the frequency or placement of certain words in the text. The performance of the model can be improved by focusing on the features that are most important to the information that is most likely to be useful for classification. Additionally, feature selection can also help to reduce the dimensionality of the dataset, making the model more efficient and easier to interpret. A method for extracting aspect terms from product reviews is presented in the research paper. This method makes use of the Gini index, information gain, and feature selection in conjunction with the Machine learning classifiers. In the proposed method, which is referred to as wRMR, the Gini index and information gain are utilized for feature selection. Following that, machine learning classifiers are utilized in order to extract aspect terms from product reviews. A set of customer testimonials is used to assess how well the projected method works, and the findings indicate that in terms of the extraction of aspect terms, the method that has been proposed is superior to the method that has been traditionally used. In addition, the recommended approach is contrasted with methods that are currently thought of as being state-of-the-art, and the comparison reveals that the proposed method achieves superior performance compared to the other methods. In general, the method that was presented provides a promising solution for the extraction of aspect terms, and it can also be utilized for other natural language processing tasks
    corecore