Search CORE

9,530 research outputs found

Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction

Author: A Statnikov
AC Tan
C Bishop
C Lai
D Geman
DG Beer
I Guyon
I Inza
J Jin
J Weston
LJ van 't Veer
Mark A Kon
MH Asyali
P Baldi
Ping Shi
Qifu Zhu
R Blanco
R Kohavi
S Hanshall
S Ma
S Yoon
SL Pomeroy
Surajit Ray
TM Cover
TR Golub
TS Furey
V Vinaya
VN Vapnik
X Zhang
Y Saeys
Y Wang
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers. Results We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets. Conclusions The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis

CiteSeerX

Crossref

Boston University Institutional Repository (OpenBU)

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Enlighten

Rank discriminants for predicting phenotypes from RNA expression

Author: Afsari Bahman
Braga-Neto Ulisses M.
Geman Donald
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2014
Field of study

Statistical methods for analyzing large-scale biomolecular data are commonplace in computational biology. A notable example is phenotype prediction from gene expression data, for instance, detecting human cancers, differentiating subtypes and predicting clinical outcomes. Still, clinical applications remain scarce. One reason is that the complexity of the decision rules that emerge from standard statistical learning impedes biological understanding, in particular, any mechanistic interpretation. Here we explore decision rules for binary classification utilizing only the ordering of expression among several genes; the basic building blocks are then two-gene expression comparisons. The simplest example, just one comparison, is the TSP classifier, which has appeared in a variety of cancer-related discovery studies. Decision rules based on multiple comparisons can better accommodate class heterogeneity, and thereby increase accuracy, and might provide a link with biological mechanism. We consider a general framework ("rank-in-context") for designing discriminant functions, including a data-driven selection of the number and identity of the genes in the support ("context"). We then specialize to two examples: voting among several pairs and comparing the median expression in two groups of genes. Comprehensive experiments assess accuracy relative to other, more complex, methods, and reinforce earlier observations that simple classifiers are competitive.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS738 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Texas A&M Repository

rfTSP: A Non-parametric predictive model with order-based feature selection for transcriptomic data

Author: Cahill Kelly
Publication venue
Publication date: 29/01/2020
Field of study

Genomic data has strong potential to predict biologic classifications using gene expression data. For example, tumor subtype can be determined using machine learning models and gene expression profiles. We propose the use of Top Scoring Pairs in combination with machine learning to improve inter-study prediction of genomic profiles. Inter-study prediction refers to two studies that are completely independent either in terms of platform or tissue. Top Scoring Pairs (TSPs) rank pairs of genes according to how well they are expressed between different groups of subjects. For example, gene A will be lowly expressed in cases, and gene B will be highly expressed in controls, while gene A will be highly expressed in controls, and gene B will be lowly expressed in cases. The pairs demonstrate an inverse relationship with respect to one and another. Using TSPs act not only as a feature selection step, but also allows for a non parametric method that transforms the continuous expression data to 0,1, which is based on the rank of the pairs. Due to the robust nature of the transformed data, our methods demonstrate that the use of TSP binary data is much more effective in prediction than continuous data, particularly in cross study prediction. Furthermore, we extend the use of TSPs to not only binary and multi-class label prediction, but also continuous classification. The objective of this paper is to demonstrate how using dichotomized data from TSPs as the feature space for machine learning methods, particularly random forest, returns stronger prediction accuracy across independent studies than traditional machine learning techniques with log2 and quantile normalization of data. This work has significant public health impact as accurate genomic prediction is crucial for early detection of many serious illnesses such as cancer

D-Scholarship@Pitt

A Pairwise Feature Selection Method For Gene Data Using Information Gain

Author: Gui Tian
Publication venue: eGrove
Publication date: 01/01/2014
Field of study

The current technical practice for doing classification has limitations when using gene expression microarray data. For example, the robustness of top scoring pairs does not extend to some datasets involving small data size and the gene set with best discrimination power may not be involve a combination of genes. Hence, it is necessary to construct a discriminative and stable classifier that generates highly informative gene sets. As we know, not all the features will be active in a biological process. So a good feature selector should be robust with respect to noise and outliers; the challenge is to select the most informative genes. In this study, the top discriminating pair (TDP) approach is motivated by this issue and aims to reveal which features are highly ranked according to their discrimination power. To identify TDPS, each pair of genes is assigned a score based on their relative probability distribution. Our experiment combines the TDP methodology with information gain (ig) to achieve an effective feature set. To illustrate the effectiveness of TDP with ig, we applied this method to two breast cancer datasets (Wang et al., 2005 and Van\u27t Veer et al., 2002). The result from these experimental datasets using the TDP method is competitive with the baseline method using random forests. Information gain combined with the TDP algorithm used in this study provides a new effective method for feature selection for machine learning

eGrove (Univ. of Mississippi)

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Author: A Ivshina
Anne-Claire Haury
C Ambroise
C Fan
C Lai
C Sotiriou
C Sotiriou
F Reyal
G Abraham
H Zou
I Guyon
I Guyon
J Bi
J Mairal
J Wang
Jean-Philippe Vert
JPA Ioannidis
L Ein-Dor
L Ein-Dor
M Dai
Muy-Teck Teh
N Meinshausen
P Wirapati
Pierre Gestraud
R Kohavi
R Shen
R Simon
R Tibshirani
RA Irizarry
S Michiels
T Abeel
T Barrett
T Iwamoto
W Shi
Y Benjamini
Y Pawitan
Y Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/06/2011
Field of study

Motivation: Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. Methods: We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. Results: We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Simple filter methods generally outperform more complex embedded or wrapper methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results. Availability: Code and data are publicly available at http://cbio.ensmp.fr/~ahaury/

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

HAL Descartes

HAL-MINES ParisTech

Recommended from our members

An Overview of the Use of Neural Networks for Data Mining Tasks

Author: Alberts B
Alpaydin E
Ando T
Blake CL
Bramer MA
Castanheira LG
Han J
Lu H
Mitchell M
Ni X
Quinlan RJ
Rumelhart DE
Shafer JC
Shendure J
Simić D
Stahl F
Steinwart I
Surjandari I
Wei JS
Widrow B
Witten IH
Zaslavsky B
Zhang D
Publication venue: 'Wiley'
Publication date: 01/01/2012
Field of study

In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks

Central Archive at the University of Reading

Crossref

Portsmouth University Research Portal (Pure)

Bournemouth University Research Online

Statistical methods for tissue array images - algorithmic scoring and co-training

Author: Knudsen Beatrice
Linden Michael
Randolph Timothy
Wang Pei
Yan Donghui
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2011
Field of study

Recent advances in tissue microarray technology have allowed immunohistochemistry to become a powerful medium-to-high throughput analysis tool, particularly for the validation of diagnostic and prognostic biomarkers. However, as study size grows, the manual evaluation of these assays becomes a prohibitive limitation; it vastly reduces throughput and greatly increases variability and expense. We propose an algorithm - Tissue Array Co-Occurrence Matrix Analysis (TACOMA) - for quantifying cellular phenotypes based on textural regularity summarized by local inter-pixel relationships. The algorithm can be easily trained for any staining pattern, is absent of sensitive tuning parameters and has the ability to report salient pixels in an image that contribute to its score. Pathologists' input via informative training patches is an important aspect of the algorithm that allows the training for any specific marker or cell type. With co-training, the error rate of TACOMA can be reduced substantially for a very small training sample (e.g., with size 30). We give theoretical insights into the success of co-training via thinning of the feature set in a high-dimensional setting when there is "sufficient" redundancy among the features. TACOMA is flexible, transparent and provides a scoring process that can be evaluated with clarity and confidence. In a study based on an estrogen receptor (ER) marker, we show that TACOMA is comparable to, or outperforms, pathologists' performance in terms of accuracy and repeatability.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS543 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref