679 research outputs found
Elephant Search with Deep Learning for Microarray Data Analysis
Even though there is a plethora of research in Microarray gene expression
data analysis, still, it poses challenges for researchers to effectively and
efficiently analyze the large yet complex expression of genes. The feature
(gene) selection method is of paramount importance for understanding the
differences in biological and non-biological variation between samples. In
order to address this problem, a novel elephant search (ES) based optimization
is proposed to select best gene expressions from the large volume of microarray
data. Further, a promising machine learning method is envisioned to leverage
such high dimensional and complex microarray dataset for extracting hidden
patterns inside to make a meaningful prediction and most accurate
classification. In particular, stochastic gradient descent based Deep learning
(DL) with softmax activation function is then used on the reduced features
(genes) for better classification of different samples according to their gene
expression levels. The experiments are carried out on nine most popular Cancer
microarray gene selection datasets, obtained from UCI machine learning
repository. The empirical results obtained by the proposed elephant search
based deep learning (ESDL) approach are compared with most recent published
article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl
Motif Discovery through Predictive Modeling of Gene Regulation
We present MEDUSA, an integrative method for learning motif models of
transcription factor binding sites by incorporating promoter sequence and gene
expression data. We use a modern large-margin machine learning approach, based
on boosting, to enable feature selection from the high-dimensional search space
of candidate binding sequences while avoiding overfitting. At each iteration of
the algorithm, MEDUSA builds a motif model whose presence in the promoter
region of a gene, coupled with activity of a regulator in an experiment, is
predictive of differential expression. In this way, we learn motifs that are
functional and predictive of regulatory response rather than motifs that are
simply overrepresented in promoter sequences. Moreover, MEDUSA produces a model
of the transcriptional control logic that can predict the expression of any
gene in the organism, given the sequence of the promoter region of the target
gene and the expression state of a set of known or putative transcription
factors and signaling molecules. Each motif model is either a -length
sequence, a dimer, or a PSSM that is built by agglomerative probabilistic
clustering of sequences with similar boosting loss. By applying MEDUSA to a set
of environmental stress response expression data in yeast, we learn motifs
whose ability to predict differential expression of target genes outperforms
motifs from the TRANSFAC dataset and from a previously published candidate set
of PSSMs. We also show that MEDUSA retrieves many experimentally confirmed
binding sites associated with environmental stress response from the
literature.Comment: RECOMB 200
Assessing similarity of feature selection techniques in high-dimensional domains
Recent research efforts attempt to combine multiple feature selection techniques instead of using a single one. However, this combination is often made on an “ad hoc” basis, depending on the specific problem at hand, without considering the degree of diversity/similarity of the involved methods. Moreover, though it is recognized that different techniques may return quite dissimilar outputs, especially in high dimensional/small sample size domains, few direct comparisons exist that quantify these differences and their implications on classification performance. This paper aims to provide a contribution in this direction by proposing a general methodology for assessing the similarity between the outputs of different feature selection methods in high dimensional classification problems. Using as benchmark the genomics domain, an empirical study has been conducted to compare some of the most popular feature selection methods, and useful insight has been obtained about their pattern of agreement
An efficient statistical feature selection approach for classification of gene expression data
AbstractClassification of gene expression data plays a significant role in prediction and diagnosis of diseases. Gene expression data has a special characteristic that there is a mismatch in gene dimension as opposed to sample dimension. All genes do not contribute for efficient classification of samples. A robust feature selection algorithm is required to identify the important genes which help in classifying the samples efficiently. In order to select informative genes (features) based on relevance and redundancy characteristics, many feature selection algorithms have been introduced in the past. Most of the earlier algorithms require computationally expensive search strategy to find an optimal feature subset. Existing feature selection methods are also sensitive to the evaluation measures. The paper introduces a novel and efficient feature selection approach based on statistically defined effective range of features for every class termed as ERGS (Effective Range based Gene Selection). The basic principle behind ERGS is that higher weight is given to the feature that discriminates the classes clearly. Experimental results on well-known gene expression datasets illustrate the effectiveness of the proposed approach. Two popular classifiers viz. Nave Bayes Classifier (NBC) and Support Vector Machine (SVM) have been used for classification. The proposed feature selection algorithm can be helpful in ranking the genes and also is capable of identifying the most relevant genes responsible for diseases like leukemia, colon tumor, lung cancer, diffuse large B-cell lymphoma (DLBCL), prostate cancer
Gene Expression based Survival Prediction for Cancer Patients: A Topic Modeling Approach
Cancer is one of the leading cause of death, worldwide. Many believe that
genomic data will enable us to better predict the survival time of these
patients, which will lead to better, more personalized treatment options and
patient care. As standard survival prediction models have a hard time coping
with the high-dimensionality of such gene expression (GE) data, many projects
use some dimensionality reduction techniques to overcome this hurdle. We
introduce a novel methodology, inspired by topic modeling from the natural
language domain, to derive expressive features from the high-dimensional GE
data. There, a document is represented as a mixture over a relatively small
number of topics, where each topic corresponds to a distribution over the
words; here, to accommodate the heterogeneity of a patient's cancer, we
represent each patient (~document) as a mixture over cancer-topics, where each
cancer-topic is a mixture over GE values (~words). This required some
extensions to the standard LDA model eg: to accommodate the "real-valued"
expression values - leading to our novel "discretized" Latent Dirichlet
Allocation (dLDA) procedure. We initially focus on the METABRIC dataset, which
describes breast cancer patients using the r=49,576 GE values, from
microarrays. Our results show that our approach provides survival estimates
that are more accurate than standard models, in terms of the standard
Concordance measure. We then validate this approach by running it on the
Pan-kidney (KIPAN) dataset, over r=15,529 GE values - here using the mRNAseq
modality - and find that it again achieves excellent results. In both cases, we
also show that the resulting model is calibrated, using the recent
"D-calibrated" measure. These successes, in two different cancer types and
expression modalities, demonstrates the generality, and the effectiveness, of
this approach
Unsupervised fuzzy pattern discovery in gene expression data
2010-2011 > Academic research: refereed > Publication in refereed journalpublished_fina
Predicting Genetic Regulatory Response Using Classification
We present a novel classification-based method for learning to predict gene
regulatory response. Our approach is motivated by the hypothesis that in simple
organisms such as Saccharomyces cerevisiae, we can learn a decision rule for
predicting whether a gene is up- or down-regulated in a particular experiment
based on (1) the presence of binding site subsequences (``motifs'') in the
gene's regulatory region and (2) the expression levels of regulators such as
transcription factors in the experiment (``parents''). Thus our learning task
integrates two qualitatively different data sources: genome-wide cDNA
microarray data across multiple perturbation and mutant experiments along with
motif profile data from regulatory sequences. We convert the regression task of
predicting real-valued gene expression measurement to a classification task of
predicting +1 and -1 labels, corresponding to up- and down-regulation beyond
the levels of biological and measurement noise in microarray measurements. The
learning algorithm employed is boosting with a margin-based generalization of
decision trees, alternating decision trees. This large-margin classifier is
sufficiently flexible to allow complex logical functions, yet sufficiently
simple to give insight into the combinatorial mechanisms of gene regulation. We
observe encouraging prediction accuracy on experiments based on the Gasch S.
cerevisiae dataset, and we show that we can accurately predict up- and
down-regulation on held-out experiments. Our method thus provides predictive
hypotheses, suggests biological experiments, and provides interpretable insight
into the structure of genetic regulatory networks.Comment: 8 pages, 4 figures, presented at Twelfth International Conference on
Intelligent Systems for Molecular Biology (ISMB 2004), supplemental website:
http://www.cs.columbia.edu/compbio/geneclas
- …