70,320 research outputs found
An efficient statistical feature selection approach for classification of gene expression data
AbstractClassification of gene expression data plays a significant role in prediction and diagnosis of diseases. Gene expression data has a special characteristic that there is a mismatch in gene dimension as opposed to sample dimension. All genes do not contribute for efficient classification of samples. A robust feature selection algorithm is required to identify the important genes which help in classifying the samples efficiently. In order to select informative genes (features) based on relevance and redundancy characteristics, many feature selection algorithms have been introduced in the past. Most of the earlier algorithms require computationally expensive search strategy to find an optimal feature subset. Existing feature selection methods are also sensitive to the evaluation measures. The paper introduces a novel and efficient feature selection approach based on statistically defined effective range of features for every class termed as ERGS (Effective Range based Gene Selection). The basic principle behind ERGS is that higher weight is given to the feature that discriminates the classes clearly. Experimental results on well-known gene expression datasets illustrate the effectiveness of the proposed approach. Two popular classifiers viz. Nave Bayes Classifier (NBC) and Support Vector Machine (SVM) have been used for classification. The proposed feature selection algorithm can be helpful in ranking the genes and also is capable of identifying the most relevant genes responsible for diseases like leukemia, colon tumor, lung cancer, diffuse large B-cell lymphoma (DLBCL), prostate cancer
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Elephant Search with Deep Learning for Microarray Data Analysis
Even though there is a plethora of research in Microarray gene expression
data analysis, still, it poses challenges for researchers to effectively and
efficiently analyze the large yet complex expression of genes. The feature
(gene) selection method is of paramount importance for understanding the
differences in biological and non-biological variation between samples. In
order to address this problem, a novel elephant search (ES) based optimization
is proposed to select best gene expressions from the large volume of microarray
data. Further, a promising machine learning method is envisioned to leverage
such high dimensional and complex microarray dataset for extracting hidden
patterns inside to make a meaningful prediction and most accurate
classification. In particular, stochastic gradient descent based Deep learning
(DL) with softmax activation function is then used on the reduced features
(genes) for better classification of different samples according to their gene
expression levels. The experiments are carried out on nine most popular Cancer
microarray gene selection datasets, obtained from UCI machine learning
repository. The empirical results obtained by the proposed elephant search
based deep learning (ESDL) approach are compared with most recent published
article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl
Extreme Value Distribution Based Gene Selection Criteria for Discriminant Microarray Data Analysis Using Logistic Regression
One important issue commonly encountered in the analysis of microarray data
is to decide which and how many genes should be selected for further studies.
For discriminant microarray data analyses based on statistical models, such as
the logistic regression models, gene selection can be accomplished by a
comparison of the maximum likelihood of the model given the real data,
, and the expected maximum likelihood of the model given an
ensemble of surrogate data with randomly permuted label, .
Typically, the computational burden for obtaining is immense,
often exceeding the limits of computing available resources by orders of
magnitude. Here, we propose an approach that circumvents such heavy
computations by mapping the simulation problem to an extreme-value problem. We
present the derivation of an asymptotic distribution of the extreme-value as
well as its mean, median, and variance. Using this distribution, we propose two
gene selection criteria, and we apply them to two microarray datasets and three
classification tasks for illustration.Comment: to be published in Journal of Computational Biology (2004
Effect Size Estimation and Misclassification Rate Based Variable Selection in Linear Discriminant Analysis
Supervised classifying of biological samples based on genetic information,
(e.g. gene expression profiles) is an important problem in biostatistics. In
order to find both accurate and interpretable classification rules variable
selection is indispensable. This article explores how an assessment of the
individual importance of variables (effect size estimation) can be used to
perform variable selection. I review recent effect size estimation approaches
in the context of linear discriminant analysis (LDA) and propose a new
conceptually simple effect size estimation method which is at the same time
computationally efficient. I then show how to use effect sizes to perform
variable selection based on the misclassification rate which is the data
independent expectation of the prediction error. Simulation studies and real
data analyses illustrate that the proposed effect size estimation and variable
selection methods are competitive. Particularly, they lead to both compact and
interpretable feature sets.Comment: 21 pages, 2 figure
- …