84 research outputs found

    The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining

    Get PDF
    Current classification approaches usually do not try to achieve a balance between fitting and generalization when they infer models from training data. Such approaches ignore the possibility of different penalty costs for the false-positive, false-negative, and unclassifiable types. Thus, their performances may not be optimal or may even be coincidental. This dissertation analyzes the above issues in depth. It also proposes two new approaches called the Homogeneity-Based Algorithm (HBA) and the Convexity-Based Algorithm (CBA) to address these issues. These new approaches aim at optimally balancing the data fitting and generalization behaviors of models when some traditional classification approaches are used. The approaches first define the total misclassification cost (TC) as a weighted function of the three penalty costs and their corresponding error rates. The approaches then partition the training data into regions. In the HBA, the partitioning is done according to some homogeneous properties derivable from the training data. Meanwhile, the CBA employs some convex properties to derive regions. A traditional classification method is then used in conjunction with the HBA and CBA. Finally, the approaches apply a genetic approach to determine the optimal levels of fitting and generalization. The TC serves as the fitness function in this genetic approach. Real-life datasets from a wide spectrum of domains were used to better understand the effectiveness of the HBA and CBA. The computational results have indicated that both the HBA and CBA might potentially fill a critical gap in the implementation of current or future classification approaches. Furthermore, the results have also shown that when the penalty cost of an error type was changed, the corresponding error rate followed stepwise patterns. The finding of stepwise patterns of classification errors can assist researchers in determining applicable penalties for classification errors. Thus, the dissertation also proposes a binary search approach (BSA) to produce those patterns. Real-life datasets were utilized to demonstrate for the BSA

    Application of Synthetic Informative Minority Over-Sampling (SIMO) Algorithm Leveraging Support Vector Machine (SVM) On Small Datasets with Class Imbalance

    Get PDF
    Developing predictive models for classification problems considering imbalanced datasets is one of the basic difficulties in data mining and decision-analytics. A classifier’s performance will decline dramatically when applied to an imbalanced dataset. Standard classifiers such as logistic regression, Support Vector Machine (SVM) are appropriate for balanced training sets whereas provides suboptimal classification results when used on unbalanced dataset. Performance metric with prediction accuracy encourages a bias towards the majority class, while the rare instances remain unknown though the model contributes a high overall precision. There are chances where minority instances might be treated as noise and vice versa. (Haixiang et al., 2017). Wide range of Class Imbalanced learning techniques are introduced to overcome the above-mentioned problems, although each has some advantages and shortcomings. This paper provides details on the behavior of a novel imbalanced learning technique Synthetic Informative Minority Over-Sampling (SIMO) Algorithm Leveraging Support Vector Machine (SVM) on small datasets of records less than 200. Base classifiers, Logistic regression and SVM is used to validate the impact of SIMO on classifier’s performance in terms of metrices G-mean and Area Under Curve. A Comparison is derived between SIMO and other algorithms SMOTE, Smote-Borderline, ADAYSN to evaluate performance of SIMO over others

    In-Depth Performance Analysis of SMOTE-Based Oversampling Algorithms in Binary Classification

    Get PDF
    In the field of machine learning, the problem of class imbalance considerably impairs the performance of classification algorithms. Various techniques have been proposed that seek to mitigate classifier bias with respect to the majority class, with simple oversampling approaches being one of the most effective. Their main representative is the well-known SMOTE algorithm, which introduces a synthetic instances creation mechanism as an interpolation procedure between minority instances. To date, an abundance of SMOTE-based extensions that intend to improve the original algorithm has been proposed. This paper aims to compare the performance of several such extensions. In addition to comparing the overall performance, the impact of the selected oversamplers on the per-class performance is also evaluated. Finally, this paper tries to interpret the obtained performance results with respect to the internal procedures of oversampling algorithms. Some interesting findings have been made in this regard

    Learning From Major Accidents: A Meta-Learning Perspective

    Get PDF
    Learning from the past is essential to improve safety and reliability in the chemical industry. In the context of Industry 4.0 and Industry 5.0, where Artificial Intelligence and IoT are expanding throughout every industrial sector, it is essential to determine if an artificial learner may exploit historical accident data to support a more efficient and sustainable learning framework. One important limitation of Machine Learning algorithms is their difficulty in generalizing over multiple tasks. In this context, the present study aims to investigate the issue of meta-learning and transfer learning, evaluating whether the knowledge extracted from a generic accident database could be used to predict the consequence of new, technology-specific accidents. To this end, a classi-fication algorithm is trained on a large and generic accident database to learn the relationship between accident features and consequence severity from a diverse pool of examples. Later, the acquired knowledge is transferred to another domain to predict the number of fatalities and injuries in new accidents. The methodology is eval-uated on a test case, where two classification algorithms are trained on a generic accident database (i.e., the Major Hazard Incident Data Service) and evaluated on a technology-specific, lower-quality database. The results suggest that automated algorithms can learn from historical data and transfer knowledge to predict the severity of different types of accidents. The findings indicate that the knowledge gained from previous tasks might be used to address new tasks. Therefore, the proposed approach reduces the need for new data and the cost of the analyses

    Learning from Major Accidents: a Machine Learning Approach

    Get PDF
    A B S T R A C T Learning from past mistakes is crucial to prevent the reoccurrence of accidents involving dangerous sub-stances. Nevertheless, historical accident data are rarely used by the industry, and their full potential is largely unexpressed. In this setting, this study set out to take advantage of improvements in data sci-ence and Machine Learning to exploit accident data and build a predictive model for severity prediction. The proposed method makes use of classification algorithms to map the features of an accident to the corresponding severity category (i.e., the number of people that are killed and injured). Data extracted from existing databases is used to train the model. The method has been applied to a case study, where three classification models - i.e., Wide, Deep Neural Network, and Wide&Deep - have been trained and evaluated on the Major Hazard Incident Data Service database (MHIDAS). The results indicate that the Wide&Deep model offers the best performance.(c) 2022 The Authors. Published by Elsevier Ltd.This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/

    Classification of rare land cover types: Distinguishing annual and perennial crops in an agricultural catchment in South Korea

    Get PDF
    Many environmental data are inherently imbalanced, with some majority land use and land cover types dominating over rare ones. In cultivated ecosystems minority classes are often the target as they might indicate a beginning land use change. Most standard classifiers perform best on a balanced distribution of classes, and fail to detect minority classes. We used the synthetic minority oversampling technique (smote) with Random Forest to classify land cover classes in a small agricultural catchment in South Korea using modis time series. This area faces a major soil erosion problem and policy measures encourage farmers to replace annual by perennial crops to mitigate this issue. Our major goal was therefore to improve the classification performance on annual and perennial crops. We compared four different classification scenarios on original imbalanced and synthetically oversampled balanced data to quantify the effect of smote on classification performance. smote substantially increased the true positive rate of all oversampled minority classes. However, the performance on minor classes remained lower than on the majority class. We attribute this result to a class overlap already present in the original data set that is not resolved by smote. Our results show that resampling algorithms could help to derive more accurate land use and land cover maps from freely available data. These maps can be used to provide information on the distribution of land use classes in heterogeneous agricultural areas and could potentially benefit decision making

    Data exploration by using the monotonicity property

    Get PDF
    Dealing with different misclassification costs has been a big problem for classification. Some algorithms can predict quite accurately when assuming the misclassification costs for each class are the same, like most rule induction methods. However, when the misclassification costs change, which is a common phenomenon in reality, these algorithms are not capable of adjusting their results. Some other algorithms, like the Bayesian methods, have the ability to yield probabilities of a certain unclassified example belonging to given classes, which is helpful to make modification on the results according to different misclassification costs. The shortcoming of such algorithms is, when the misclassification costs for each class are the same, they do not generate the most accurate results. This thesis attempts to incorporate the merits of both kinds of algorithms into one. That is, to develop a new algorithm which can predict relatively accurately and can adjust to the change of misclassification costs. The strategy of the new algorithm is to create a weighted voting system. A weighted voting system will evaluate the evidence of the new example belonging to each class, calculate the assessment of probabilities for the example, and assign the example to a certain class according to the probabilities as well as the misclassification costs. The main problem of creating a weighted voting system is to decide the optimal weights of the individual votes. To solve this problem, we will mainly refer to the monotonicity property. People have found the monotonicity property does not only exist in pure monotone systems, but also exists in non-monotone systems. Since the study of the monotonicity property has been a huge success on monotone systems, it is only natural to apply the monotonicity property to non-monotone systems too. This thesis deals only with binary systems. Though such systems hardly exist in practice, this treatment provides concrete ideas for the development of general solution algorithms. After the final algorithm has been formulated, it has been tested on a wide range of randomly generated synthetic datasets. It has also been compared with other existing classifiers. The results indicate this algorithm performs both effectively and efficiently
    corecore