5 research outputs found

    A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking.</p> <p>Results</p> <p>The major contribution of this paper is to present a unified methodology which allows many common (statistical) response models to be fitted to such data sets. The class of models includes virtually any model with a linear predictor in it, for example (but not limited to), multiclass logistic regression (classification), generalised linear models (regression) and survival models. A fast algorithm for finding sparse well fitting models is presented. The ideas are illustrated on real data sets with numbers of variables ranging from thousands to millions. R code implementing the ideas is available for download.</p> <p>Conclusion</p> <p>The method described in this paper enables existing work on response models when there are less variables than observations to be leveraged to the situation when there are many more variables than observations. It is a powerful approach to finding parsimonious models for such datasets. The method is capable of handling problems with millions of variables and a large variety of response types within the one framework. The method compares favourably to existing methods such as support vector machines and random forests, but has the advantage of not requiring separate variable selection steps. It is also works for data types which these methods were not designed to handle. The method usually produces very sparse models which make biological interpretation simpler and more focused.</p

    Mapping Salinity Using Decision Trees and Conditional Probabilistic Networks

    No full text
    This paper examines the use of different classifiers for integrating multi-temporal remotely sensed data with landform data derived from digital elevation models to produce maps showing areas affected by salinity in the south west agricultural region of Western Australia. Decision trees are used to map saline areas in the Ryan&apos;s Brook catchment, located approximately 50 kilometres southwest of Kojonup, WA. The results are compared with maximum likelihood classification techniques using single-date Landsat TM imagery. The non-parametric decision tree classifiers combine multi-temporal Landsat TM data with landform data derived from digital elevation models to produce more accurate salinity maps. However, the maps exhibited large amounts of noise and showed errors which might be improved by incorporating prior knowledge about the relationships between input attributes and their relationship with salinity

    Categorical Causal Modeling

    No full text
    corecore