2,012 research outputs found
A sparse regulatory network of copy-number driven expression reveals putative breast cancer oncogenes
The influence of DNA cis-regulatory elements on a gene's expression has been
intensively studied. However, little is known about expressions driven by
trans-acting DNA hotspots. DNA hotspots harboring copy number aberrations are
recognized to be important in cancer as they influence multiple genes on a
global scale. The challenge in detecting trans-effects is mainly due to the
computational difficulty in detecting weak and sparse trans-acting signals
amidst co-occuring passenger events. We propose an integrative approach to
learn a sparse interaction network of DNA copy-number regions with their
downstream targets in a breast cancer dataset. Information from this network
helps distinguish copy-number driven from copy-number independent expression
changes on a global scale. Our result further delineates cis- and trans-effects
in a breast cancer dataset, for which important oncogenes such as ESR1 and
ERBB2 appear to be highly copy-number dependent. Further, our model is shown to
be efficient and in terms of goodness of fit no worse than other state-of the
art predictors and network reconstruction models using both simulated and real
data.Comment: Accepted at IEEE International Conference on Bioinformatics &
Biomedicine (BIBM 2010
Network estimation in State Space Model with L1-regularization constraint
Biological networks have arisen as an attractive paradigm of genomic science
ever since the introduction of large scale genomic technologies which carried
the promise of elucidating the relationship in functional genomics. Microarray
technologies coupled with appropriate mathematical or statistical models have
made it possible to identify dynamic regulatory networks or to measure time
course of the expression level of many genes simultaneously. However one of the
few limitations fall on the high-dimensional nature of such data coupled with
the fact that these gene expression data are known to include some hidden
process. In that regards, we are concerned with deriving a method for inferring
a sparse dynamic network in a high dimensional data setting. We assume that the
observations are noisy measurements of gene expression in the form of mRNAs,
whose dynamics can be described by some unknown or hidden process. We build an
input-dependent linear state space model from these hidden states and
demonstrate how an incorporated regularization constraint in an
Expectation-Maximization (EM) algorithm can be used to reverse engineer
transcriptional networks from gene expression profiling data. This corresponds
to estimating the model interaction parameters. The proposed method is
illustrated on time-course microarray data obtained from a well established
T-cell data. At the optimum tuning parameters we found genes TRAF5, JUND, CDK4,
CASP4, CD69, and C3X1 to have higher number of inwards directed connections and
FYB, CCNA2, AKT1 and CASP8 to be genes with higher number of outwards directed
connections. We recommend these genes to be object for further investigation.
Caspase 4 is also found to activate the expression of JunD which in turn
represses the cell cycle regulator CDC2.Comment: arXiv admin note: substantial text overlap with arXiv:1308.359
Weighted-Lasso for Structured Network Inference from Time Course Data
We present a weighted-Lasso method to infer the parameters of a first-order
vector auto-regressive model that describes time course expression data
generated by directed gene-to-gene regulation networks. These networks are
assumed to own a prior internal structure of connectivity which drives the
inference method. This prior structure can be either derived from prior
biological knowledge or inferred by the method itself. We illustrate the
performance of this structure-based penalization both on synthetic data and on
two canonical regulatory networks, first yeast cell cycle regulation network by
analyzing Spellman et al's dataset and second E. coli S.O.S. DNA repair network
by analysing U. Alon's lab data
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Elephant Search with Deep Learning for Microarray Data Analysis
Even though there is a plethora of research in Microarray gene expression
data analysis, still, it poses challenges for researchers to effectively and
efficiently analyze the large yet complex expression of genes. The feature
(gene) selection method is of paramount importance for understanding the
differences in biological and non-biological variation between samples. In
order to address this problem, a novel elephant search (ES) based optimization
is proposed to select best gene expressions from the large volume of microarray
data. Further, a promising machine learning method is envisioned to leverage
such high dimensional and complex microarray dataset for extracting hidden
patterns inside to make a meaningful prediction and most accurate
classification. In particular, stochastic gradient descent based Deep learning
(DL) with softmax activation function is then used on the reduced features
(genes) for better classification of different samples according to their gene
expression levels. The experiments are carried out on nine most popular Cancer
microarray gene selection datasets, obtained from UCI machine learning
repository. The empirical results obtained by the proposed elephant search
based deep learning (ESDL) approach are compared with most recent published
article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl
Infinite Latent Feature Selection: A Probabilistic Latent Graph-Based Ranking Approach
Feature selection is playing an increasingly significant role with respect to
many computer vision applications spanning from object recognition to visual
object tracking. However, most of the recent solutions in feature selection are
not robust across different and heterogeneous set of data. In this paper, we
address this issue proposing a robust probabilistic latent graph-based feature
selection algorithm that performs the ranking step while considering all the
possible subsets of features, as paths on a graph, bypassing the combinatorial
problem analytically. An appealing characteristic of the approach is that it
aims to discover an abstraction behind low-level sensory data, that is,
relevancy. Relevancy is modelled as a latent variable in a PLSA-inspired
generative process that allows the investigation of the importance of a feature
when injected into an arbitrary set of cues. The proposed method has been
tested on ten diverse benchmarks, and compared against eleven state of the art
feature selection methods. Results show that the proposed approach attains the
highest performance levels across many different scenarios and difficulties,
thereby confirming its strong robustness while setting a new state of the art
in feature selection domain.Comment: Accepted at the IEEE International Conference on Computer Vision
(ICCV), 2017, Venice. Preprint cop
Kernel methods in genomics and computational biology
Support vector machines and kernel methods are increasingly popular in
genomics and computational biology, due to their good performance in real-world
applications and strong modularity that makes them suitable to a wide range of
problems, from the classification of tumors to the automatic annotation of
proteins. Their ability to work in high dimension, to process non-vectorial
data, and the natural framework they provide to integrate heterogeneous data
are particularly relevant to various problems arising in computational biology.
In this chapter we survey some of the most prominent applications published so
far, highlighting the particular developments in kernel methods triggered by
problems in biology, and mention a few promising research directions likely to
expand in the future
A Regularized Method for Selecting Nested Groups of Relevant Genes from Microarray Data
Gene expression analysis aims at identifying the genes able to accurately
predict biological parameters like, for example, disease subtyping or
progression. While accurate prediction can be achieved by means of many
different techniques, gene identification, due to gene correlation and the
limited number of available samples, is a much more elusive problem. Small
changes in the expression values often produce different gene lists, and
solutions which are both sparse and stable are difficult to obtain. We propose
a two-stage regularization method able to learn linear models characterized by
a high prediction performance. By varying a suitable parameter these linear
models allow to trade sparsity for the inclusion of correlated genes and to
produce gene lists which are almost perfectly nested. Experimental results on
synthetic and microarray data confirm the interesting properties of the
proposed method and its potential as a starting point for further biological
investigationsComment: 17 pages, 8 Post-script figure
DNA expression microarrays may be the wrong tool to identify biological pathways
DNA microarray expression signatures are expected to provide new insights into patho- physiological pathways. Numerous variant statistical methods have been described for each step of the signal analysis. We employed five similar statistical tests on the same data set at the level of gene selection. Inter-test agreement for the identification of biological pathways in BioCarta, KEGG and Reactome was calculated using Cohen’s k- score. The identification of specific biological pathways showed only moderate agreement (0.30 < k < 0.79) between the analysis methods used. Pathways identified by microarrays must be treated cautiously as they vary according to the statistical method used
- …