8 research outputs found
Ensembles based on Random Projection for gene expression data analysis
In this work we focused on methods to solve classification problems characterized
by high dimensionality and low cardinality data. These features are relevant
in bio-molecular data analysis and particularly in class prediction whith microarray
data.
Many methods have been proposed to approach this problem, characterized by
the so called curse of dimensionality (term introduced by Richard Bellman (9)).
Among them, gene selection methods, principal and independent component analysis,
kernel methods.
In this work we propose and we experimentally analyze two ensemble methods
based on two randomized techniques for data compression: Random Subspaces
and Random Projections. While Random Subspaces, originally proposed by T.
K. Ho, is a technique related to feature subsampling, Random Projections is a feature
extraction technique motivated by the Johnson-Lindenstrauss theory about
distance preserving random projections.
The randomness underlying the proposed approach leads to diverse sets of extracted
features corresponding to low dimensional subspaces with low metric distortion
and approximate preservation of the expected loss of the trained base
classifiers.
In the first part of the work we justify our approach with two theoretical results.
The first regards unsupervised learning: we prove that a clustering algorithm minimizing
the objective (quadratic) function provides a -closed solution if applied
to compressed data according to Johnson-Lindenstrauss theory.
The second one is related to supervised learning: we prove that Polynomials kernels
are approximatively preserved by Random Projections, up to a degradation proportional to the square of the degree of the polynomial.
In the second part of the work, we propose ensemble algorithms based on Random
Subspaces and Random Projections, and we experimentally compare them
with single SVM and other state-of-the-art ensemble methods, using three gene
expression data set: Colon, Leukemia and DLBL-FL - i.e. Diffuse Large B-cell
and Follicular Lymphoma. The obtained results confirm the effectiveness of the
proposed approach.
Moreover, we observed a certain performance degradation of Random Projection
methods when the base learners are SVMs with polynomial kernel of high degree
Random matrices in data analysis
Abstract. We show how carefully crafted random matrices can achieve distance-preserving dimensionality reduction, accelerate spectral computations, and reduce the sample complexity of certain kernel methods.