76,151 research outputs found

    Ensemble of Feature Selection Techniques for High Dimensional Data

    Get PDF
    Data mining involves the use of data analysis tools to discover previously unknown, valid patterns and relationships from large amounts of data stored in databases, data warehouses, or other information repositories. Feature selection is an important preprocessing step of data mining that helps increase the predictive performance of a model. The main aim of feature selection is to choose a subset of features with high predictive information and eliminate irrelevant features with little or no predictive information. Using a single feature selection technique may generate local optima. In this thesis we propose an ensemble approach for feature selection, where multiple feature selection techniques are combined to yield more robust and stable results. Ensemble of multiple feature ranking techniques is performed in two steps. The first step involves creating a set of different feature selectors, each providing its sorted order of features, while the second step aggregates the results of all feature ranking techniques. The ensemble method used in our study is frequency count which is accompanied by mean to resolve any frequency count collision. Experiments conducted in this work are performed on the datasets collected from Kent Ridge bio-medical data repository. Lung Cancer dataset and Lymphoma dataset are selected from the repository to perform experiments. Lung Cancer dataset consists of 57 attributes and 32 instances and Lymphoma dataset consists of 4027 attributes and 96 ix instances. Experiments are performed on the reduced datasets obtained from feature ranking. These datasets are used to build the classification models. Model performance is evaluated in terms of AUC (Area under Receiver Operating Characteristic Curve) performance metric. ANOVA tests are also performed on the AUC performance metric. Experimental results suggest that ensemble of multiple feature selection techniques is more effective than an individual feature selection technique

    Class-Level Refactoring Prediction by Ensemble Learning with Various Feature Selection Techniques

    Get PDF
    Background: Refactoring is changing a software system without affecting the software functionality. The current researchers aim i to identify the appropriate method(s) or class(s) that needs to be refactored in object-oriented software. Ensemble learning helps to reduce prediction errors by amalgamating different classifiers and their respective performances over the original feature data. Other motives are added in this paper regarding several ensemble learners, errors, sampling techniques, and feature selection techniques for refactoring prediction at the class level. Objective: This work aims to develop an ensemble-based refactoring prediction model with structural identification of source code metrics using different feature selection techniques and data sampling techniques to distribute the data uniformly. Our model finds the best classifier after achieving fewer errors during refactoring prediction at the class level. Methodology: At first, our proposed model extracts a total of 125 software metrics computed from object-oriented software systems processed for a robust multi-phased feature selection method encompassing Wilcoxon significant text, Pearson correlation test, and principal component analysis (PCA). The proposed multi-phased feature selection method retains the optimal features characterizing inheritance, size, coupling, cohesion, and complexity. After obtaining the optimal set of software metrics, a novel heterogeneous ensemble classifier is developed using techniques such as ANN-Gradient Descent, ANN-Levenberg Marquardt, ANN-GDX, ANN-Radial Basis Function; support vector machine with different kernel functions such as LSSVM-Linear, LSSVM-Polynomial, LSSVM-RBF, Decision Tree algorithm, Logistic Regression algorithm and extreme learning machine (ELM) model are used as the base classifier. In our paper, we have calculated four different errors i.e., Mean Absolute Error (MAE), Mean magnitude of Relative Error (MORE), Root Mean Square Error (RMSE), and Standard Error of Mean (SEM). Result: In our proposed model, the maximum voting ensemble (MVE) achieves better accuracy, recall, precision, and F-measure values (99.76, 99.93, 98.96, 98.44) as compared to the base trained ensemble (BTE) and it experiences less errors (MAE = 0.0057, MORE = 0.0701, RMSE = 0.0068, and SEM = 0.0107) during its implementation to develop the refactoring model. Conclusions: Our experimental result recommends that MVE with upsampling can be implemented to improve the performance of the refactoring prediction model at the class level. Furthermore, the performance of our model with different data sampling techniques and feature selection techniques has been shown in the form boxplot diagram of accuracy, F-measure, precision, recall, and area under the curve (AUC) parameters.publishedVersio

    Spectral Band Selection for Ensemble Classification of Hyperspectral Images with Applications to Agriculture and Food Safety

    Get PDF
    In this dissertation, an ensemble non-uniform spectral feature selection and a kernel density decision fusion framework are proposed for the classification of hyperspectral data using a support vector machine classifier. Hyperspectral data has more number of bands and they are always highly correlated. To utilize the complete potential, a feature selection step is necessary. In an ensemble situation, there are mainly two challenges: (1) Creating diverse set of classifiers in order to achieve a higher classification accuracy when compared to a single classifier. This can either be achieved by having different classifiers or by having different subsets of features for each classifier in the ensemble. (2) Designing a robust decision fusion stage to fully utilize the decision produced by individual classifiers. This dissertation tests the efficacy of the proposed approach to classify hyperspectral data from different applications. Since these datasets have a small number of training samples with larger number of highly correlated features, conventional feature selection approaches such as random feature selection cannot utilize the variability in the correlation level between bands to achieve diverse subsets for classification. In contrast, the approach proposed in this dissertation utilizes the variability in the correlation between bands by dividing the spectrum into groups and selecting bands from each group according to its size. The intelligent decision fusion proposed in this approach uses the probability density of training classes to produce a final class label. The experimental results demonstrate the validity of the proposed framework that results in improvements in the overall, user, and producer accuracies compared to other state-of-the-art techniques. The experiments demonstrate the ability of the proposed approach to produce more diverse feature selection over conventional approaches

    Classifying malignant brain tumours from 1H-MRS data using Breadth Ensemble Learning

    Get PDF
    In neuro oncology, the accurate diagnostic identification and characterization of tumours is paramount for determining their prognosis and the adequate course of treatment. This is usually a difficult problem per se, due to the localization of the tumour in an extremely sensitive and difficult to reach organ such as the brain. The clinical analysis of brain tumours often requires the use of non-invasive measurement methods, the most common of which resort to imaging techniques. The discrimination between high-grade malignant tumours of different origin but similar characteristics, such as glioblastomas and metastases, is a particularly difficult problem in this context. This is because imaging techniques are often not sensitive enough and their spectroscopic signal is overall too similar. In spite of this, machine learning techniques, coupled with robust feature selection procedures, have recently made substantial inroads into the problem. In this study, magnetic resonance spectroscopy data from an international, multicentre database were used to discriminate between these two types of malignant brain tumours using ensemble learning techniques, with a focus on the definition of a feature selection method specifically designed for ensembles. This method, Breadth Ensemble Learning, takes advantage of the fact that many of the frequencies of the available spectra convey no relevant information for the discrimination of the tumours. The potential of the proposed method is supported by some of the best results reported to date for this problem.Postprint (author's final draft

    Machine Learning-Based Ensemble Recursive Feature Selection of Circulating miRNAs for Cancer Tumor Classification

    Get PDF
    Lopez-Rincon A, Mendoza-Maldonado L, Martinez-Archundia M, et al. Machine Learning-Based Ensemble Recursive Feature Selection of Circulating miRNAs for Cancer Tumor Classification. Cancers. 2020;12(7): 1785.Circulating microRNAs (miRNA) are small noncoding RNA molecules that can be detected in bodily fluids without the need for major invasive procedures on patients. miRNAs have shown great promise as biomarkers for tumors to both assess their presence and to predict their type and subtype. Recently, thanks to the availability of miRNAs datasets, machine learning techniques have been successfully applied to tumor classification. The results, however, are difficult to assess and interpret by medical experts because the algorithms exploit information from thousands of miRNAs. In this work, we propose a novel technique that aims at reducing the necessary information to the smallest possible set of circulating miRNAs. The dimensionality reduction achieved reflects a very important first step in a potential, clinically actionable, circulating miRNA-based precision medicine pipeline. While it is currently under discussion whether this first step can be taken, we demonstrate here that it is possible to perform classification tasks by exploiting a recursive feature elimination procedure that integrates a heterogeneous ensemble of high-quality, state-of-the-art classifiers on circulating miRNAs. Heterogeneous ensembles can compensate inherent biases of classifiers by using different classification algorithms. Selecting features then further eliminates biases emerging from using data from different studies or batches, yielding more robust and reliable outcomes. The proposed approach is first tested on a tumor classification problem in order to separate 10 different types of cancer, with samples collected over 10 different clinical trials, and later is assessed on a cancer subtype classification task, with the aim to distinguish triple negative breast cancer from other subtypes of breast cancer. Overall, the presented methodology proves to be effective and compares favorably to other state-of-the-art feature selection methods

    Robust self-healing prediction model for high dimensional data

    Full text link
    Owing to the advantages of increased accuracy and the potential to detect unseen patterns, provided by data mining techniques they have been widely incorporated for standard classification problems. They have often been used for high precision disease prediction in the medical field, and several hybrid prediction models capable of achieving high accuracies have been proposed. Though this stands true most of the previous models fail to efficiently address the recurring issue of bad data quality which plagues most high dimensional data, and especially proves troublesome in the highly sensitive medical data. This work proposes a robust self healing (RSH) hybrid prediction model which functions by using the data in its entirety by removing errors and inconsistencies from it rather than discarding any data. Initial processing involves data preparation followed by cleansing or scrubbing through context-dependent attribute correction, which ensures that there is no significant loss of relevant information before the feature selection and prediction phases. An ensemble of heterogeneous classifiers, subjected to local boosting, is utilized to build the prediction model and genetic algorithm based wrapper feature selection technique wrapped on the respective classifiers is employed to select the corresponding optimal set of features, which warrant higher accuracy. The proposed method is compared with some of the existing high performing models and the results are analyzed

    Stable Feature Selection for Biomarker Discovery

    Full text link
    Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development
    corecore