99 research outputs found

    An enhanced resampling technique for imbalanced data sets

    Get PDF
    A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance of binary classification. Furthermore, the techniques performed well on small record size data sets that have of instances in the range of approximately 100 to 800

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    Predictive Framework for Imbalance Dataset

    Get PDF
    The purpose of this research is to seek and propose a new predictive maintenance framework which can be used to generate a prediction model for deterioration of process materials. Real yield data which was obtained from Fuji Electric Malaysia has been used in this research. The existing data pre-processing and classification methodologies have been adapted in this research. Properties of the proposed framework include; developing an approach to correlate materials defects, developing an approach to represent data attributes features, analyzing various ratio and types of data re-sampling, analyzing the impact of data dimension reduction for various data size, and partitioning data size and algorithmic schemes against the prediction performance. Experimental results suggested that the class probability distribution function of a prediction model has to be closer to a training dataset; less skewed environment enable learning schemes to discover better function F in a bigger Fall space within a higher dimensional feature space, data sampling and partition size is appear to proportionally improve the precision and recall if class distribution ratios are balanced. A comparative study was also conducted and showed that the proposed approaches have performed better. This research was conducted based on limited number of datasets, test sets and variables. Thus, the obtained results are applicable only to the study domain with selected datasets. This research has introduced a new predictive maintenance framework which can be used in manufacturing industries to generate a prediction model based on the deterioration of process materials. Consequently, this may allow manufactures to conduct predictive maintenance not only for equipments but also process materials. The major contribution of this research is a step by step guideline which consists of methods/approaches in generating a prediction for process materials

    DETECTION OF FINANCIAL INFORMATION MANIPULATION BY AN ENSEMBLE-BASED MECHANISM

    Get PDF
    Abstract: Complicated financial information manipulation, involving heightened offender knowledge of transactional procedures, can be damaging to the reputations of corporations and the auditors, as well as cause serious turbulence in financial markets. Unfortunately, most incidents of financial information manipulation involve higher level managers who are truly knowledgeable and comprehend the limitations of standard auditing procedures. Thus, there is an urgent need for additional detection mechanisms to prevent financial information manipulation. To address this problem, the author proposes an ensemble-based mechanism (EM) consisting of feature selection and extraction ensemble and extreme learning machine (ELM). The model not only counters the redundancy-removing problem, but also gives direction to auditors who need to allocate limited audit resources to abnormal client relationships during the auditing procedure and protect the CPA firms' reputation. The experimental results demonstrate that the model is a promising alternative for detecting financial information manipulation, and one that can ensure both the confidence of investors and the stability of financial markets

    Deep Learning for predictive maintenance

    Get PDF
    Recently, with the appearance of Industry 4.0 (I4.0), machine learning (ML) within artificial intelligence (AI), industrial Internet of things (IIoT) and cyber-physical system (CPS) have accelerated the development of a data-orientated applications such as predictive maintenance (PdM). PdM applied to asset-dependent industries has led to operational cost savings, productivity improvements and enhanced safety management capabilities. In addition, predictive maintenance strategies provide useful information concerning the source of the failure or malfunction, reducing unnecessary maintenance operations. The concept of prognostics and health management (PHM) has appeared as a predictive maintenance process. PHM has become an unavoidable tendency in smart manufacturing to offer a reliable solution for handling industrial equipment’s health status. This later requires efficient and effective system health monitoring methods, including processing and analysing massive machinery data to detect anomalies and perform diagnosis and prognosis. Prognostics is considered a key PHM process with capabilities for predicting future states, mainly based on predicting the residual lifetime during which a machine can perform its intended function, i.e., estimating the remaining useful life (RUL) of a system. The prognostic research domain is far from being mature, which is still new and explains the various challenges that must be addressed. Therefore, the work presented in this thesis will mainly focus on the prognostic of monitored machinery from an RUL estimation point of view using Deep Learning (DL) algorithms. Capitalising on the recent success of the DL, this dissertation introduces methods and algorithms dedicated to predictive maintenance. We focused on improving the performance of aero-engine prognostic, particularly in estimating an accurate RUL using ensemble learning and deep learning. To this end, two contributions have been proposed, and the results obtained were validated by an extensive comparative analysis using public C-MAPSS turbofan engine benchmark datasets. The first contribution, for RUL predictions, we proposed two-hybrid methods based on the promising DL architectures to leverage the power of the multimodal and hybrid deep neural network in order to capture various information at different time intervals and ultimately achieve more accurate RUL predictions. The proposed end-to-end deep architectures jointly optimise the feature reduction and RUL prediction steps in a hierarchical manner, intending to achieve data representation in low dimensionality and minimal variable redundancy while preserving critical asset degradation information with minimal preprocessing effort. The second contribution, in a practical situation, RUL is usually affected by uncertainty. Therefore, we proposed an innovative RUL estimation strategy that assesses degrading machinery’s health status (provides the probabilities of system failure in different time windows) and provides the prediction of RUL window. Keywords: Prognostics and Health Management (PHM), Remaining useful life (RUL), Predictive Maintenance (PdM), C-MAPSS dataset, Ensemble learning, Deep learnin

    Enhanced default risk models with SVM+

    Get PDF
    Default risk models have lately raised a great interest due to the recent world economic crisis. In spite of many advanced techniques that have extensively been proposed, no comprehensive method incorporating a holistic perspective has hitherto been considered. Thus, the existing models for bankruptcy prediction lack the whole coverage of contextual knowledge which may prevent the decision makers such as investors and financial analysts to take the right decisions. Recently, SVM+ provides a formal way to incorporate additional information (not only training data) onto the learning models improving generalization. In financial settings examples of such non-financial (though relevant) information are marketing reports, competitors landscape, economic environment, customers screening, industry trends, etc. By exploiting additional information able to improve classical inductive learning we propose a prediction model where data is naturally separated into several structured groups clustered by the size and annual turnover of the firms. Experimental results in the setting of a heterogeneous data set of French companies demonstrated that the proposed default risk model showed better predictability performance than the baseline SVM and multi-task learning with SVM.info:eu-repo/semantics/publishedVersio

    Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets

    Get PDF
    There is a common notion that traditional unsupervised feature extraction algorithms follow the assumption that the distribution of the different clusters in a dataset is balanced. However, feature selection is guided by the calculation of similarities among features when topic keywords are extracted from a large number of unmarked, unbalanced text datasets. As a result, the selected features cannot truly reflect the information of the original data set, which thus affects the subsequent performance of classifiers. To solve this problem, a new method of extracting unsupervised text topic-related genes is proposed in this paper. Firstly, a sample cluster group is obtained by factor analysis and a density peak algorithm, based on which the dataset is marked. Then, considering the influence of the unbalanced distribution of sample clusters on feature selection, the CHI statistical matrix feature selection method, which combines average local density and information entropy together, is used to strengthen the features of low-density small-sample clusters. Finally, a related gene extraction method based on the exploration of high-order relevance in multidimensional statistical data is described, which uses independent component analysis to enhance the generalisability of the selected features. In this way, unsupervised text topic-related genes can be extracted from large unbalanced datasets. The results of experiments suggest that the proposed method of extracting unsupervised text topic-related genes is better than existing methods in extracting text subject terms from low-density small-sample clusters, and has higher prematurity and feature dimension-reduction ability

    On the relevance of preprocessing in predictive maintenance for dynamic systems

    Get PDF
    The complexity involved in the process of real-time data-driven monitoring dynamic systems for predicted maintenance is usually huge. With more or less in-depth any data-driven approach is sensitive to data preprocessing, understood as any data treatment prior to the application of the monitoring model, being sometimes crucial for the final development of the employed monitoring technique. The aim of this work is to quantify the sensitiveness of data-driven predictive maintenance models in dynamic systems in an exhaustive way. We consider a couple of predictive maintenance scenarios, each of them defined by some public available data. For each scenario, we consider its properties and apply several techniques for each of the successive preprocessing steps, e.g. data cleaning, missing values treatment, outlier detection, feature selection, or imbalance compensation. The pretreatment configurations, i.e. sequential combinations of techniques from different preprocessing steps, are considered together with different monitoring approaches, in order to determine the relevance of data preprocessing for predictive maintenance in dynamical systems
    corecore