95 research outputs found
Ordinal Classifiers Can Fail on Repetitive Class Structures
Ordinal classifiers are constrained classification algorithms that assume a predefined (total) order of the class labels to be reflected in the feature space of a dataset. This information is used to guide the training of ordinal classifiers and might lead to an improved classification performance. Incorrect assumptions on the order of a dataset can result in diminished detection rates. Ordinal classifiers can, therefore, be used to screen for ordinal class structures within a feature representation. While it was shown that algorithms could in principle reject incorrect class orderings, it is unclear if all remaining candidate orders reflect real ordinal structures in feature space. In this work we characterize the decision regions induced by ordinal classifiers. We show that they can fulfill different criteria that might be considered as ordinal reflections. These criteria are mainly determined by the connectedness and the neighborhood of the decision regions. We evaluate them for ordinal classifier cascades constructed from binary classifiers. We show that depending on the type of base classifier they bear the risk of not rejecting non ordinal, like partial repetitive, structures
Multi-Objective Parameter Selection for Classifiers
Setting the free parameters of classifiers to different values can have a profound impact on their performance. For some methods, specialized tuning algorithms have been developed. These approaches mostly tune parameters according to a single criterion, such as the cross-validation error. However, it is sometimes desirable to obtain parameter values that optimize several concurrent - often conflicting - criteria. The TunePareto package provides a general and highly customizable framework to select optimal parameters for classifiers according to multiple objectives. Several strategies for sampling and optimizing parameters are supplied. The algorithm determines a set of Pareto-optimal parameter configurations and leaves the ultimate decision on the weighting of objectives to the researcher. Decision support is provided by novel visualization techniques
Ordinal Prototype-Based Classifiers
The identification of prototypical patterns is one of the major goals in the classification of microarray data. Prototype-based classifiers are of special interest in this context, since they allow a direct biological interpretation. In this work we present prototype-based classifiers that rely on ordinal-scaled data. Advantage of these ordinal-scaled signatures is their invariance to a wide range of data transformations. Standard prototype-based classifiers can be modified to this type of data by utilizing rank-distances and rank-aggregation procedures. In this study, we compare the proposed methods with standard classifiers. They are examined in experiments with and without feature selection on a panel of publicly available microarray datasets. We show that the proposed techniques result in the construction of different signatures that improve classification performance
Adjusted Measures for Feature Selection Stability for Data Sets with Similar Features
For data sets with similar features, for example highly correlated features,
most existing stability measures behave in an undesired way: They consider
features that are almost identical but have different identifiers as different
features. Existing adjusted stability measures, that is, stability measures
that take into account the similarities between features, have major
theoretical drawbacks. We introduce new adjusted stability measures that
overcome these drawbacks. We compare them to each other and to existing
stability measures based on both artificial and real sets of selected features.
Based on the results, we suggest using one new stability measure that considers
highly similar features as exchangeable
Semantic Multi-Classifier Systems for the Analysis of Gene Expression Profiles
The analysis of biomolecular data from high-throughput screens is typically characterized by the high dimensionality of the measured profiles. Development of diagnostic tools for this kind of data, such as gene expression profiles, is often coupled to an interest of users in obtaining interpretable and low-dimensional classification models; as this facilitates the generation of biological hypotheses on possible causes of a categorization. Purely data driven classification models are limited in this regard. These models only allow for interpreting the data in terms of marker combinations, often gene expression levels, and rarely bridge the gap to higher-level explanations such as molecular signaling pathways. Here, we incorporate into the classification process, additionally to the expression profile data, different data sources that functionally organize these individual gene expression measurements into groups. The members of such a group of measurements share a common property or characterize a more abstract biological concept. These feature subgroups are then used for the generation of individual classifiers. From the set of these classifiers, subsets are combined to a multi-classifier system. Analysing which individual classifiers, and thus which biological concepts such as pathways or ontology terms, are important for classification, make it possible to generate hypotheses about the distinguishing characteristics of the classes on a functional level
Predicting disease progression in behavioral variant frontotemporal dementia
Introduction: The behavioral variant of frontotemporal dementia (bvFTD) is a rare neurodegenerative disease. Reliable predictors of disease progression have not been sufficiently identified. We investigated multivariate magnetic resonance imaging (MRI) biomarker profiles for their predictive value of individual decline. Methods: One hundred five bvFTD patients were recruited from the German frontotemporal lobar degeneration (FTLD) consortium study. After defining two groups ("fast progressors" vs. "slow progressors"), we investigated the predictive value of MR brain volumes for disease progression rates performing exhaustive screenings with multivariate classification models. Results: We identified areas that predict disease progression rate within 1 year. Prediction measures revealed an overall accuracy of 80% across our 50 top classification models. Especially the pallidum, middle temporal gyrus, inferior frontal gyrus, cingulate gyrus, middle orbitofrontal gyrus, and insula occurred in these models. Discussion: Based on the revealed marker combinations an individual prognosis seems to be feasible. This might be used in clinical studies on an individualized progression model
Ensemble of a subset of kNN classifiers
Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines
Combined microRNA and mRNA microfluidic TaqMan array cards for the diagnosis of malignancy of multiple types of pancreatico-biliary tumors in fine-needle aspiration material
Pancreatic ductal adenocarcinoma (PDAC) continues to carry the lowest survival rates among all solid tumors. A marked resistance against available therapies, late clinical presentation and insufficient means for early diagnosis contribute to the dismal prognosis. Novel biomarkers are thus required to aid treatment decisions and improve patient outcomes. We describe here a multi-omics molecular platform that allows for the first time to simultaneously analyze miRNA and mRNA expression patterns from minimal amounts of biopsy material on a single microfluidic TaqMan Array card. Expression profiles were generated from 113 prospectively collected fine needle aspiration biopsies (FNAB) from patients undergoing surgery for suspect masses in the pancreas. Molecular classifiers were constructed using support vector machines, and rigorously evaluated for diagnostic performance using 10
710fold cross validation. The final combined miRNA/mRNA classifier demonstrated a sensitivity of 91.7%, a specificity of 94.5%, and an overall diagnostic accuracy of 93.0% for the differentiation between PDAC and benign pancreatic masses, clearly outperfoming miRNA-only classifiers. The classification algorithm also performed very well in the diagnosis of other types of solid tumors (acinar cell carcinomas, ampullary cancer and distal bile duct carcinomas), but was less suited for the diagnostic analysis of cystic lesions. We thus demonstrate that simultaneous analysis of miRNA and mRNA biomarkers from FNAB samples using multi-omics TaqMan Array cards is suitable to differentiate suspect solid pancreatic masses with high precision
A feature selection method for classification within functional genomics experiments based on the proportional overlapping score
Background: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task.Results: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.Conclusions: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes
- âŠ