Search CORE

2 research outputs found

A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations

Author: A Dempster
A Spira
B Schèolkopf
C Ambroise
DA Hinds
DR Cox
GN Watson
Harri T Kiiveri
HT Kiiveri
I Guyon
JA Nelder
JC Platt
JX Zhu
L Breiman
M Abramowitz
M Figueiredo
M Figueiredo
ME Ross
MY Park
P McCullagh
R Tibshirani
RDC Team
S Kotz
S Zhang
SA Tomlins
SS Dave
SS Keerthi
T Zhang
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking. Results The major contribution of this paper is to present a unified methodology which allows many common (statistical) response models to be fitted to such data sets. The class of models includes virtually any model with a linear predictor in it, for example (but not limited to), multiclass logistic regression (classification), generalised linear models (regression) and survival models. A fast algorithm for finding sparse well fitting models is presented. The ideas are illustrated on real data sets with numbers of variables ranging from thousands to millions. R code implementing the ideas is available for download. Conclusion The method described in this paper enables existing work on response models when there are less variables than observations to be leveraged to the situation when there are many more variables than observations. It is a powerful approach to finding parsimonious models for such datasets. The method is capable of handling problems with millions of variables and a large variety of response types within the one framework. The method compares favourably to existing methods such as support vector machines and random forests, but has the advantage of not requiring separate variable selection steps. It is also works for data types which these methods were not designed to handle. The method usually produces very sparse models which make biological interpretation simpler and more focused.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Datamining approaches for modeling tumor control probability

Crossref