50 research outputs found

    Feature Selection via Robust Weighted Score for High Dimensional Binary Class-Imbalanced Gene Expression Data

    Full text link
    In this paper, a robust weighted score for unbalanced data (ROWSU) is proposed for selecting the most discriminative feature for high dimensional gene expression binary classification with class-imbalance problem. The method addresses one of the most challenging problems of highly skewed class distributions in gene expression datasets that adversely affect the performance of classification algorithms. First, the training dataset is balanced by synthetically generating data points from minority class observations. Second, a minimum subset of genes is selected using a greedy search approach. Third, a novel weighted robust score, where the weights are computed by support vectors, is introduced to obtain a refined set of genes. The highest-scoring genes based on this approach are combined with the minimum subset of genes selected by the greedy search approach to form the final set of genes. The novel method ensures the selection of the most discriminative genes, even in the presence of skewed class distribution, thus improving the performance of the classifiers. The performance of the proposed ROWSU method is evaluated on 66 gene expression datasets. Classification accuracy and sensitivity are used as performance metrics to compare the proposed ROWSU algorithm with several other state-of-the-art methods. Boxplots and stability plots are also constructed for a better understanding of the results. The results show that the proposed method outperforms the existing feature selection procedures based on classification performance from k nearest neighbours (kNN) and random forest (RF) classifiers.Comment: 25 page

    An Optimal k Nearest Neighbours Ensemble for Classification Based on Extended Neighbourhood Rule with Features subspace

    Full text link
    To minimize the effect of outliers, kNN ensembles identify a set of closest observations to a new sample point to estimate its unknown class by using majority voting in the labels of the training instances in the neighbourhood. Ordinary kNN based procedures determine k closest training observations in the neighbourhood region (enclosed by a sphere) by using a distance formula. The k nearest neighbours procedure may not work in a situation where sample points in the test data follow the pattern of the nearest observations that lie on a certain path not contained in the given sphere of nearest neighbours. Furthermore, these methods combine hundreds of base kNN learners and many of them might have high classification errors thereby resulting in poor ensembles. To overcome these problems, an optimal extended neighbourhood rule based ensemble is proposed where the neighbours are determined in k steps. It starts from the first nearest sample point to the unseen observation. The second nearest data point is identified that is closest to the previously selected data point. This process is continued until the required number of the k observations are obtained. Each base model in the ensemble is constructed on a bootstrap sample in conjunction with a random subset of features. After building a sufficiently large number of base models, the optimal models are then selected based on their performance on out-of-bag (OOB) data.Comment: 12 page

    Ensemble Pruning for Glaucoma Detection in an Unbalanced Data Set

    Get PDF
    Background: Random forests are successful classifier ensemble methods consisting of typically 100 to 1000 classification trees. Ensemble pruning techniques reduce the computational cost, especially the memory demand, of random forests by reducing the number of trees without relevant loss of performance or even with increased performance of the sub-ensemble. The application to the problem of an early detection of glaucoma, a severe eye disease with low prevalence, based on topographical measurements of the eye background faces specific challenges. Objectives: We examine the performance of ensemble pruning strategies for glaucoma detection in an unbalanced data situation. Methods: The data set consists of 102 topographical features of the eye background of 254 healthy controls and 55 glaucoma patients. We compare the area under the receiver operating characteristic curve (AUC), and the Brier score on the total data set, in the majority class, and in the minority class of pruned random forest ensembles obtained with strategies based on the prediction accuracy of greedily grown sub-ensembles, the uncertainty weighted accuracy, and the similarity between single trees. To validate the findings and to examine the influence of the prevalence of glaucoma in the data set, we additionally perform a simulation study with lower prevalences of glaucoma. Results: In glaucoma classification all three pruning strategies lead to improved AUC and smaller Brier scores on the total data set with sub-ensembles as small as 30 to 80 trees compared to the classification results obtained with the full ensemble consisting of 1000 trees. In the simulation study, we were able to show that the prevalence of glaucoma is a critical factor and lower prevalence decreases the performance of our pruning strategies. Conclusions: The memory demand for glaucoma classification in an unbalanced data situation based on random forests could effectively be reduced by the application of pruning strategies without loss of performance in a population with increased risk of glaucoma

    Sedentary behaviour and physical activity levels in employees of Khyber Medical University Peshawar

    Get PDF
    Introduction: The increase in sedentary behaviour and decrease in physical activity levels are some of the contributing factors to many of the non-communicable diseases. These non-communicable diseases included obesity, type-II diabetes and cardiovascular problems. Apart from causing financial burden on health care system, these diseases have been reported to cause nearly 1.9 million premature deaths per year. The aim of the study was to measure sedentary behaviour and physical activity levels among employees of Khyber Medical University, Peshawar. Material & Methods: A cross-sectional survey was conducted on employees of Khyber Medical University, Peshawar. The total sample size was 172 and the data was collected through convenience sampling by using International Physical Activity Questionnaire (IPAQ) long form. This questionnaire measures physical activity levels and sedentary behaviour at work. Results: Out of 172 participants, 154 (89.5%) were male and 18 (10.4%) were female with a mean age of 34.4 ± 2 years. According to the levels of physical activity, 49 (28.5%) were less active, 63 (36.6%) were moderately active and 60 (34.9%) were highly active. The average time spent by the participants for sitting was (8.93 ± 2.35) hours per day. A total of 73.8%, 23.3 % and 2.9% participants could be categorised as having high, moderate and low sedentary behaviour, respectively. Conclusion: Majority of the participants (two-thirds of the participants) demonstrated a high sedentary behaviour and therefore, needed modification in their daily routine

    Optimal trees selection for classification via out-of-bag assessment and sub-bagging

    Get PDF
    The effect of training data size on machine learning methods has been well investigated over the past two decades. The predictive performance of tree based machine learning methods, in general, improves with a decreasing rate as the size of training data increases. We investigate this in optimal trees ensemble (OTE) where the method fails to learn from some of the training observations due to internal validation. Modified tree selection methods are thus proposed for OTE to cater for the loss of training observations in internal validation. In the first method, corresponding out-of-bag (OOB) observations are used in both individual and collective performance assessment for each tree. Trees are ranked based on their individual performance on the OOB observations. A certain number of top ranked trees is selected and starting from the most accurate tree, subsequent trees are added one by one and their impact is recorded by using the OOB observations left out from the bootstrap sample taken for the tree being added. A tree is selected if it improves predictive accuracy of the ensemble. In the second approach, trees are grown on random subsets, taken without replacement-known as sub-bagging, of the training data instead of bootstrap samples (taken with replacement). The remaining observations from each sample are used in both individual and collective assessments for each corresponding tree similar to the first method. Analysis on 21 benchmark datasets and simulations studies show improved performance of the modified methods in comparison to OTE and other state-of-the-art methods

    Ensemble of Optimal Trees, Random Forest and Random Projection Ensemble Classification

    Get PDF
    The predictive performance of a random forest ensemble is highly associated with the strength of individual trees and their diversity. Ensemble of a small number of accurate and diverse trees, if prediction accuracy is not compromised, will also reduce computational burden. We investigate the idea of integrating trees that are accurate and diverse. For this purpose, we utilize out-of-bag observations as a validation sample from the training bootstrap samples, to choose the best trees based on their individual performance and then assess these trees for diversity using the Brier score on an independent validation sample. Starting from the first best tree, a tree is selected for the final ensemble if its addition to the forest reduces error of the trees that have already been added. Our approach does not use an implicit dimension reduction for each tree as random project ensemble classification. A total of 35 bench mark problems on classification and regression are used to assess the performance of the proposed method and compare it with random forest, random projection ensemble, node harvest, support vector machine, kNN and classification and regression tree (CART). We compute unexplained variances or classification error rates for all the methods on the corresponding data sets. Our experiments reveal that the size of the ensemble is reduced significantly and better results are obtained in most of the cases. Results of a simulation study are also given where four tree style scenarios are considered to generate data sets with several structures

    Ensemble of a subset of kNN classifiers

    Get PDF
    Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines

    A feature selection method for classification within functional genomics experiments based on the proportional overlapping score

    Get PDF
    Background: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task.Results: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.Conclusions: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes

    A New Modified Exponent Power Alpha Family of Distributions with Applications in Reliability Engineering

    No full text
    Probability distributions perform a very significant role in the field of applied sciences, particularly in the field of reliability engineering. Engineering data sets are either negatively or positively skewed and/or symmetrical. Therefore, a flexible distribution is required that can handle such data sets. In this paper, we propose a new family of lifetime distributions to model the aforementioned data sets. This proposed family is known as a “New Modified Exponent Power Alpha Family of distributions” or in short NMEPA. The proposed family is obtained by applying the well-known T-X approach together with the exponential distribution. A three-parameter-specific sub-model of the proposed method termed a “new Modified Exponent Power Alpha Weibull distribution” (NMEPA-Wei for short), is discussed in detail. The various mathematical properties including hazard rate function, ordinary moments, moment generating function, and order statistics are also discussed. In addition, we adopted the method of maximum likelihood estimation (MLE) for estimating the unknown model parameters. A brief Monte Carlo simulation study is conducted to evaluate the performance of the MLE based on bias and mean square errors. A comprehensive study is also provided to assess the proposed family of distributions by analyzing two real-life data sets from reliability engineering. The analytical goodness of fit measures of the proposed distribution are compared with well-known distributions including (i) APT-Wei (alpha power transformed Weibull), (ii) Ex-Wei (exponentiated-Weibull), (iii) classical two-parameter Weibull, (iv) Mod-Wei (modified Weibull), and (v) Kumar-Wei (Kumaraswamy–Weibull) distributions. The proposed class of distributions is expected to produce many more new distributions for fitting monotonic and non-monotonic data in the field of reliability analysis and survival analysis
    corecore