770 research outputs found
A Survey on Soft Subspace Clustering
Subspace clustering (SC) is a promising clustering technology to identify
clusters based on their associations with subspaces in high dimensional spaces.
SC can be classified into hard subspace clustering (HSC) and soft subspace
clustering (SSC). While HSC algorithms have been extensively studied and well
accepted by the scientific community, SSC algorithms are relatively new but
gaining more attention in recent years due to better adaptability. In the
paper, a comprehensive survey on existing SSC algorithms and the recent
development are presented. The SSC algorithms are classified systematically
into three main categories, namely, conventional SSC (CSSC), independent SSC
(ISSC) and extended SSC (XSSC). The characteristics of these algorithms are
highlighted and the potential future development of SSC is also discussed.Comment: This paper has been published in Information Sciences Journal in 201
A General Spatio-Temporal Clustering-Based Non-local Formulation for Multiscale Modeling of Compartmentalized Reservoirs
Representing the reservoir as a network of discrete compartments with
neighbor and non-neighbor connections is a fast, yet accurate method for
analyzing oil and gas reservoirs. Automatic and rapid detection of coarse-scale
compartments with distinct static and dynamic properties is an integral part of
such high-level reservoir analysis. In this work, we present a hybrid framework
specific to reservoir analysis for an automatic detection of clusters in space
using spatial and temporal field data, coupled with a physics-based multiscale
modeling approach. In this work a novel hybrid approach is presented in which
we couple a physics-based non-local modeling framework with data-driven
clustering techniques to provide a fast and accurate multiscale modeling of
compartmentalized reservoirs. This research also adds to the literature by
presenting a comprehensive work on spatio-temporal clustering for reservoir
studies applications that well considers the clustering complexities, the
intrinsic sparse and noisy nature of the data, and the interpretability of the
outcome.
Keywords: Artificial Intelligence; Machine Learning; Spatio-Temporal
Clustering; Physics-Based Data-Driven Formulation; Multiscale Modelin
Multiple Imputation Ensembles (MIE) for dealing with missing data
Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases
On pruning and feature engineering in Random Forests.
Random Forest (RF) is an ensemble classification technique that was developed by Leo Breiman over a decade ago. Compared with other ensemble techniques, it has proved its accuracy and superiority. Many researchers, however, believe that there is still room for optimizing RF further by enhancing and improving its performance accuracy. This explains why there have been many extensions of RF where each extension employed a variety of techniques and strategies to improve certain aspect(s) of RF. The main focus of this dissertation is to develop new extensions of RF using new optimization techniques that, to the best of our knowledge, have never been used before to optimize RF. These techniques are clustering, the local outlier factor, diversified weighted subspaces, and replicator dynamics. Applying these techniques on RF produced four extensions which we have termed CLUB-DRF, LOFB-DRF, DSB-RF, and RDB-DR respectively. Experimental studies on 15 real datasets showed favorable results, demonstrating the potential of the proposed methods. Performance-wise, CLUB-DRF is ranked first in terms of accuracy and classifcation speed making it ideal for real-time applications, and for machines/devices with limited memory and processing power
- …