99 research outputs found
An enhanced resampling technique for imbalanced data sets
A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related
to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it
with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased
Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority
Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of
SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance
of binary classification. Furthermore, the techniques performed well on small record
size data sets that have of instances in the range of approximately 100 to 800
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is
considered \de facto" standard in the framework of learning from imbalanced data. This
is due to its simplicity in the design of the procedure, as well as its robustness when applied
to di erent type of problems. Since its publication in 2002, SMOTE has proven
successful in a variety of applications from several di erent domains. SMOTE has also inspired
several approaches to counter the issue of class imbalance, and has also signi cantly
contributed to new supervised learning paradigms, including multilabel classi cation, incremental
learning, semi-supervised learning, multi-instance learning, among others. It is
standard benchmark for learning from imbalanced data. It is also featured in a number of
di erent software packages | from open source to commercial. In this paper, marking the
fteen year anniversary of SMOTE, we re
ect on the SMOTE journey, discuss the current
state of a airs with SMOTE, its applications, and also identify the next set of challenges
to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology
under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project
887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016;
and the National Science Foundation (NSF) Grant IIS-1447795
Predictive Framework for Imbalance Dataset
The purpose of this research is to seek and propose a new predictive maintenance framework which can be used to generate a prediction model for deterioration of process materials. Real yield data which was obtained from Fuji Electric Malaysia has been used in this research. The existing data pre-processing and classification methodologies have been adapted in this research. Properties of the proposed framework include; developing an approach to correlate materials defects, developing an approach to represent data attributes features, analyzing various ratio and types of data re-sampling, analyzing the impact of data dimension reduction for various data size, and partitioning data size and algorithmic schemes against the prediction performance. Experimental results suggested that the class probability distribution function of a prediction model has to be closer to a training dataset; less skewed environment enable learning schemes to discover better function F in a bigger Fall space within a higher dimensional feature space, data sampling and partition size is appear to proportionally improve the precision and recall if class distribution ratios are balanced. A comparative study was also conducted and showed that the proposed approaches have performed better. This research was conducted based on limited number of datasets, test sets and variables. Thus, the obtained results are applicable only to the study domain with selected datasets. This research has introduced a new predictive maintenance framework which can be used in manufacturing industries to generate a prediction model based on the deterioration of process materials. Consequently, this may allow manufactures to conduct predictive maintenance not only for equipments but also process materials. The major contribution of this research is a step by step guideline which consists of methods/approaches in generating a prediction for process materials
DETECTION OF FINANCIAL INFORMATION MANIPULATION BY AN ENSEMBLE-BASED MECHANISM
Abstract: Complicated financial information manipulation, involving heightened offender knowledge of transactional procedures, can be damaging to the reputations of corporations and the auditors, as well as cause serious turbulence in financial markets. Unfortunately, most incidents of financial information manipulation involve higher level managers who are truly knowledgeable and comprehend the limitations of standard auditing procedures. Thus, there is an urgent need for additional detection mechanisms to prevent financial information manipulation. To address this problem, the author proposes an ensemble-based mechanism (EM) consisting of feature selection and extraction ensemble and extreme learning machine (ELM). The model not only counters the redundancy-removing problem, but also gives direction to auditors who need to allocate limited audit resources to abnormal client relationships during the auditing procedure and protect the CPA firms' reputation. The experimental results demonstrate that the model is a promising alternative for detecting financial information manipulation, and one that can ensure both the confidence of investors and the stability of financial markets
Deep Learning for predictive maintenance
Recently, with the appearance of Industry 4.0 (I4.0), machine learning (ML) within artificial intelligence (AI), industrial Internet of things (IIoT) and cyber-physical system (CPS) have accelerated the development of a data-orientated applications such as predictive maintenance (PdM). PdM applied to asset-dependent industries has led to operational cost savings, productivity improvements and enhanced safety management capabilities. In addition, predictive maintenance strategies provide useful information concerning the source of the failure or malfunction, reducing unnecessary maintenance operations.
The concept of prognostics and health management (PHM) has appeared as a predictive maintenance process. PHM has become an unavoidable tendency in smart manufacturing to offer a reliable solution for handling industrial equipment’s health status. This later requires efficient and effective system health monitoring methods, including processing and analysing massive machinery data to detect anomalies and perform diagnosis and prognosis. Prognostics is considered a key PHM process with capabilities for predicting future states, mainly based on predicting the residual lifetime during which a machine can perform its intended function, i.e., estimating the remaining useful life (RUL) of a system. The prognostic research domain is far from being mature, which is still new and explains the various challenges that must be addressed. Therefore, the work presented in this thesis will mainly focus on the prognostic of monitored machinery from an RUL estimation point of view using Deep Learning (DL) algorithms. Capitalising on the recent success of the DL, this dissertation introduces methods and algorithms dedicated to predictive maintenance. We focused on improving the performance of aero-engine prognostic, particularly in estimating an accurate RUL using ensemble learning and deep learning. To this end, two contributions have been proposed, and the results obtained were validated by an extensive comparative analysis using public C-MAPSS turbofan engine benchmark datasets. The first contribution, for RUL predictions, we proposed two-hybrid methods based on the promising DL architectures to leverage the power of the multimodal and hybrid deep neural network in order to capture various information at different time intervals and ultimately achieve more accurate RUL predictions. The proposed end-to-end deep architectures jointly optimise the feature reduction and RUL prediction steps in a hierarchical manner, intending to achieve data representation in low dimensionality and minimal variable redundancy while preserving critical asset degradation information with minimal preprocessing effort. The second contribution, in a practical situation, RUL is usually affected by uncertainty. Therefore, we proposed an innovative RUL estimation strategy that assesses degrading machinery’s health status (provides the probabilities of system failure in different time windows) and provides the prediction of RUL window.
Keywords: Prognostics and Health Management (PHM), Remaining useful life (RUL),
Predictive Maintenance (PdM), C-MAPSS dataset, Ensemble learning, Deep learnin
Enhanced default risk models with SVM+
Default risk models have lately raised a great interest due to the recent world economic crisis. In spite of many advanced techniques that have extensively been proposed, no comprehensive method incorporating a holistic perspective has hitherto been considered. Thus, the existing models for bankruptcy prediction lack the whole coverage of contextual knowledge which may prevent the decision makers such as investors and financial analysts to take the right decisions. Recently, SVM+ provides a formal way to incorporate additional information (not only training data) onto the learning models improving generalization. In financial settings examples of such non-financial (though relevant) information are marketing reports, competitors landscape, economic environment, customers screening, industry trends, etc. By exploiting additional information able to improve classical inductive learning we propose a prediction model where data is naturally separated into several structured groups clustered by the size and annual turnover of the firms. Experimental results in the setting of a heterogeneous data set of French companies demonstrated that the proposed default risk model showed better predictability performance than the baseline SVM and multi-task learning with SVM.info:eu-repo/semantics/publishedVersio
Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets
There is a common notion that traditional unsupervised feature extraction algorithms follow the assumption that the distribution of the different clusters in a dataset is balanced. However, feature selection is guided by the calculation of similarities among features when topic keywords are extracted from a large number of unmarked, unbalanced text datasets. As a result, the selected features cannot truly reflect the information of the original data set, which thus affects the subsequent performance of classifiers. To solve this problem, a new method of extracting unsupervised text topic-related genes is proposed in this paper. Firstly, a sample cluster group is obtained by factor analysis and a density peak algorithm, based on which the dataset is marked. Then, considering the influence of the unbalanced distribution of sample clusters on feature selection, the CHI statistical matrix feature selection method, which combines average local density and information entropy together, is used to strengthen the features of low-density small-sample clusters. Finally, a related gene extraction method based on the exploration of high-order relevance in multidimensional statistical data is described, which uses independent component analysis to enhance the generalisability of the selected features. In this way, unsupervised text topic-related genes can be extracted from large unbalanced datasets. The results of experiments suggest that the proposed method of extracting unsupervised text topic-related genes is better than existing methods in extracting text subject terms from low-density small-sample clusters, and has higher prematurity and feature dimension-reduction ability
On the relevance of preprocessing in predictive maintenance for dynamic systems
The complexity involved in the process of real-time data-driven monitoring dynamic systems for predicted maintenance is usually huge. With more or less in-depth any data-driven approach is sensitive to data preprocessing, understood as any data treatment prior to the application of the monitoring model, being sometimes crucial for the final development of the employed monitoring technique. The aim of this work is to quantify the sensitiveness of data-driven predictive maintenance models in dynamic systems in an exhaustive way.
We consider a couple of predictive maintenance scenarios, each of them defined by some public available data. For each scenario, we consider its properties and apply several techniques for each of the successive preprocessing steps, e.g. data cleaning, missing values treatment, outlier detection, feature selection, or imbalance compensation. The pretreatment configurations, i.e. sequential combinations of techniques from different preprocessing steps, are considered together with different monitoring approaches, in order to determine the relevance of data preprocessing for predictive maintenance in dynamical systems
- …