37 research outputs found
Adjusted Measures for Feature Selection Stability for Data Sets with Similar Features
For data sets with similar features, for example highly correlated features,
most existing stability measures behave in an undesired way: They consider
features that are almost identical but have different identifiers as different
features. Existing adjusted stability measures, that is, stability measures
that take into account the similarities between features, have major
theoretical drawbacks. We introduce new adjusted stability measures that
overcome these drawbacks. We compare them to each other and to existing
stability measures based on both artificial and real sets of selected features.
Based on the results, we suggest using one new stability measure that considers
highly similar features as exchangeable
Predicting disease progression in behavioral variant frontotemporal dementia
Introduction: The behavioral variant of frontotemporal dementia (bvFTD) is a rare neurodegenerative disease. Reliable predictors of disease progression have not been sufficiently identified. We investigated multivariate magnetic resonance imaging (MRI) biomarker profiles for their predictive value of individual decline. Methods: One hundred five bvFTD patients were recruited from the German frontotemporal lobar degeneration (FTLD) consortium study. After defining two groups ("fast progressors" vs. "slow progressors"), we investigated the predictive value of MR brain volumes for disease progression rates performing exhaustive screenings with multivariate classification models. Results: We identified areas that predict disease progression rate within 1 year. Prediction measures revealed an overall accuracy of 80% across our 50 top classification models. Especially the pallidum, middle temporal gyrus, inferior frontal gyrus, cingulate gyrus, middle orbitofrontal gyrus, and insula occurred in these models. Discussion: Based on the revealed marker combinations an individual prognosis seems to be feasible. This might be used in clinical studies on an individualized progression model
Ensemble of a subset of kNN classifiers
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines
A feature selection method for classification within functional genomics experiments based on the proportional overlapping score
Background: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task.Results: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.Conclusions: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes
Differentiation of multiple types of pancreatico-biliary tumors by molecular analysis of clinical specimens
Timely and accurate diagnosis of pancreatic ductal adenocarcinoma (PDAC)
is critical in order to provide adequate treatment to patients. However,
the clinical signs and symptoms of PDAC are shared by several types of
malignant or benign tumors which may be difficult to differentiate from
PDAC with conventional diagnostic procedures. Among others, these
include ampullary cancers, solid pseudopapillary tumors, and
adenocarcinomas of the distant bile duct, as well as inflammatory masses
developing in chronic pancreatitis. Here, we report an approach to
accurately differentiate between these different types of pancreatic
masses based on molecular analysis of biopsy material. A total of 156
bulk tissue and fine needle aspiration biopsy samples were analyzed
using a dedicated diagnostic cDNA array and a composite classification
algorithm developed based on linear support vector machines. All five
histological subtypes of pancreatic masses were clearly separable with
100\% accuracy when using all 156 individual samples for classification.
Generalized performance of the classification system was tested by
10x10-fold cross validation (100 test runs). Correct classification into
the five diagnostic groups was demonstrated for 81.5\% of 1,560 test set
predictions. Performance increased to 85.3\% accuracy when PDAC and
distant bile duct carcinomas were combined in a single diagnostic class.
Importantly, overall sensitivity of detection of malignant disease was
92.2\%. The molecular diagnostic approach presented here is suitable to
significantly aid in the differential diagnosis of undetermined
pancreatic masses. To our knowledge, this is the first study reporting
accurate differentiation between several types of pancreatico-biliary
tumors in a single molecular analytical procedure