16,079 research outputs found
Molecular cancer classification using an meta-sample-based regularized robust coding method
Motivation
Previous studies have demonstrated that machine learning based molecular cancer classification using gene expression profiling (GEP) data is promising for the clinic diagnosis and treatment of cancer. Novel classification methods with high efficiency and prediction accuracy are still needed to deal with high dimensionality and small sample size of typical GEP data. Recently the sparse representation (SR) method has been successfully applied to the cancer classification. Nevertheless, its efficiency needs to be improved when analyzing large-scale GEP data.
Results
In this paper we present the meta-sample-based regularized robust coding classification (MRRCC), a novel effective cancer classification technique that combines the idea of meta-sample-based cluster method with regularized robust coding (RRC) method. It assumes that the coding residual and the coding coefficient are respectively independent and identically distributed. Similar to meta-sample-based SR classification (MSRC), MRRCC extracts a set of meta-samples from the training samples, and then encodes a testing sample as the sparse linear combination of these meta-samples. The representation fidelity is measured by the l2-norm or l1-norm of the coding residual.
Conclusions
Extensive experiments on publicly available GEP datasets demonstrate that the proposed method is more efficient while its prediction accuracy is equivalent to existing MSRC-based methods and better than other state-of-the-art dimension reduction based methods.This article was funded by the National Science Foundation of China on finding tumor-related driver pathway with comprehensive analysis method based on next-generation sequencing data and the dimension reduction of gene expression data based on heuristic method (grant nos. 61474267, 60973153 and 61133010) and the National Institutes of Health (NIH) Grant P01 AG12993 (PI: E. Michaelis).
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 15, 2014: Proceedings of the 2013 International Conference on Intelligent Computing (ICIC 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S15
PLS dimension reduction for classification of microarray data
PLS dimension reduction is known to give good prediction accuracy in the context of classification with high-dimensional microarray data. In this paper, PLS is compared with some of the best state-of-the-art classification methods. In addition, a simple procedure to choose the number of components is suggested. The connection between PLS dimension reduction and gene selection is examined and a property of the first PLS component for binary classification is proven. PLS can also be used as a visualization tool for high-dimensional data in the classification framework. The whole study is based on 9 real microarray cancer data sets
Elephant Search with Deep Learning for Microarray Data Analysis
Even though there is a plethora of research in Microarray gene expression
data analysis, still, it poses challenges for researchers to effectively and
efficiently analyze the large yet complex expression of genes. The feature
(gene) selection method is of paramount importance for understanding the
differences in biological and non-biological variation between samples. In
order to address this problem, a novel elephant search (ES) based optimization
is proposed to select best gene expressions from the large volume of microarray
data. Further, a promising machine learning method is envisioned to leverage
such high dimensional and complex microarray dataset for extracting hidden
patterns inside to make a meaningful prediction and most accurate
classification. In particular, stochastic gradient descent based Deep learning
(DL) with softmax activation function is then used on the reduced features
(genes) for better classification of different samples according to their gene
expression levels. The experiments are carried out on nine most popular Cancer
microarray gene selection datasets, obtained from UCI machine learning
repository. The empirical results obtained by the proposed elephant search
based deep learning (ESDL) approach are compared with most recent published
article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl
A transfer-learning approach to feature extraction from cancer transcriptomes with deep autoencoders
Publicado en Lecture Notes in Computer Science.The diagnosis and prognosis of cancer are among the more
challenging tasks that oncology medicine deals with. With the main aim
of fitting the more appropriate treatments, current personalized medicine
focuses on using data from heterogeneous sources to estimate the evolu-
tion of a given disease for the particular case of a certain patient. In recent
years, next-generation sequencing data have boosted cancer prediction by
supplying gene-expression information that has allowed diverse machine
learning algorithms to supply valuable solutions to the problem of cancer
subtype classification, which has surely contributed to better estimation
of patientās response to diverse treatments. However, the efficacy of these
models is seriously affected by the existing imbalance between the high
dimensionality of the gene expression feature sets and the number of sam-
ples available for a particular cancer type. To counteract what is known
as the curse of dimensionality, feature selection and extraction methods
have been traditionally applied to reduce the number of input variables
present in gene expression datasets. Although these techniques work by
scaling down the input feature space, the prediction performance of tradi-
tional machine learning pipelines using these feature reduction strategies
remains moderate. In this work, we propose the use of the Pan-Cancer
dataset to pre-train deep autoencoder architectures on a subset com-
posed of thousands of gene expression samples of very diverse tumor
types. The resulting architectures are subsequently fine-tuned on a col-
lection of specific breast cancer samples. This transfer-learning approach
aims at combining supervised and unsupervised deep learning models
with traditional machine learning classification algorithms to tackle the
problem of breast tumor intrinsic-subtype classification.Universidad de MĆ”laga. Campus de Excelencia Internacional AndalucĆa Tech
High-dimensional classification using features annealed independence rules
Classification using high-dimensional features arises frequently in many
contemporary statistical studies such as tumor classification using microarray
or other high-throughput data. The impact of dimensionality on classifications
is poorly understood. In a seminal paper, Bickel and Levina [Bernoulli 10
(2004) 989--1010] show that the Fisher discriminant performs poorly due to
diverging spectra and they propose to use the independence rule to overcome the
problem. We first demonstrate that even for the independence classification
rule, classification using all the features can be as poor as the random
guessing due to noise accumulation in estimating population centroids in
high-dimensional feature space. In fact, we demonstrate further that almost all
linear discriminants can perform as poorly as the random guessing. Thus, it is
important to select a subset of important features for high-dimensional
classification, resulting in Features Annealed Independence Rules (FAIR). The
conditions under which all the important features can be selected by the
two-sample -statistic are established. The choice of the optimal number of
features, or equivalently, the threshold value of the test statistics are
proposed based on an upper bound of the classification error. Simulation
studies and real data analysis support our theoretical results and demonstrate
convincingly the advantage of our new classification procedure.Comment: Published in at http://dx.doi.org/10.1214/07-AOS504 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Extreme Value Distribution Based Gene Selection Criteria for Discriminant Microarray Data Analysis Using Logistic Regression
One important issue commonly encountered in the analysis of microarray data
is to decide which and how many genes should be selected for further studies.
For discriminant microarray data analyses based on statistical models, such as
the logistic regression models, gene selection can be accomplished by a
comparison of the maximum likelihood of the model given the real data,
, and the expected maximum likelihood of the model given an
ensemble of surrogate data with randomly permuted label, .
Typically, the computational burden for obtaining is immense,
often exceeding the limits of computing available resources by orders of
magnitude. Here, we propose an approach that circumvents such heavy
computations by mapping the simulation problem to an extreme-value problem. We
present the derivation of an asymptotic distribution of the extreme-value as
well as its mean, median, and variance. Using this distribution, we propose two
gene selection criteria, and we apply them to two microarray datasets and three
classification tasks for illustration.Comment: to be published in Journal of Computational Biology (2004
- ā¦