1,561 research outputs found
A hidden spatial-temporal Markov random field model for network-based analysis of time course gene expression data
Microarray time course (MTC) gene expression data are commonly collected to
study the dynamic nature of biological processes. One important problem is to
identify genes that show different expression profiles over time and pathways
that are perturbed during a given biological process. While methods are
available to identify the genes with differential expression levels over time,
there is a lack of methods that can incorporate the pathway information in
identifying the pathways being modified/activated during a biological process.
In this paper we develop a hidden spatial-temporal Markov random field
(hstMRF)-based method for identifying genes and subnetworks that are related to
biological processes, where the dependency of the differential expression
patterns of genes on the networks are modeled over time and over the network of
pathways. Simulation studies indicated that the method is quite effective in
identifying genes and modified subnetworks and has higher sensitivity than the
commonly used procedures that do not use the pathway structure or time
dependency information, with similar false discovery rates. Application to a
microarray gene expression study of systemic inflammation in humans identified
a core set of genes on the KEGG pathways that show clear differential
expression patterns over time. In addition, the method confirmed that the
TOLL-like signaling pathway plays an important role in immune response to
endotoxins.Comment: Published in at http://dx.doi.org/10.1214/07--AOAS145 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Nonparametric false discovery rate control for identifying simultaneous signals
It is frequently of interest to jointly analyze multiple sequences of
multiple tests in order to identify simultaneous signals, defined as features
tested in multiple studies whose test statistics are non-null in each. In many
problems, however, the null distributions of the test statistics may be
complicated or even unknown, and there do not currently exist any procedures
that can be employed in these cases. This paper proposes a new nonparametric
procedure that can identify simultaneous signals across multiple studies even
without knowing the null distributions of the test statistics. The method is
shown to asymptotically control the false discovery rate, and in simulations
had excellent power and error control. In an analysis of gene expression and
histone acetylation patterns in the brains of mice exposed to a conspecific
intruder, it identified genes that were both differentially expressed and next
to differentially accessible chromatin. The proposed method is available in the
R package github.com/sdzhao/ssa
A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining
Big data comes in various ways, types, shapes, forms and sizes. Indeed,
almost all areas of science, technology, medicine, public health, economics,
business, linguistics and social science are bombarded by ever increasing flows
of data begging to analyzed efficiently and effectively. In this paper, we
propose a rough idea of a possible taxonomy of big data, along with some of the
most commonly used tools for handling each particular category of bigness. The
dimensionality p of the input space and the sample size n are usually the main
ingredients in the characterization of data bigness. The specific statistical
machine learning technique used to handle a particular big data set will depend
on which category it falls in within the bigness taxonomy. Large p small n data
sets for instance require a different set of tools from the large n small p
variety. Among other tools, we discuss Preprocessing, Standardization,
Imputation, Projection, Regularization, Penalization, Compression, Reduction,
Selection, Kernelization, Hybridization, Parallelization, Aggregation,
Randomization, Replication, Sequentialization. Indeed, it is important to
emphasize right away that the so-called no free lunch theorem applies here, in
the sense that there is no universally superior method that outperforms all
other methods on all categories of bigness. It is also important to stress the
fact that simplicity in the sense of Ockham's razor non plurality principle of
parsimony tends to reign supreme when it comes to massive data. We conclude
with a comparison of the predictive performance of some of the most commonly
used methods on a few data sets.Comment: 18 pages, 2 figures 3 table
Listen to genes : dealing with microarray data in the frequency domain
Background: We present a novel and systematic approach to analyze temporal microarray data. The approach includes
normalization, clustering and network analysis of genes.
Methodology: Genes are normalized using an error model based uniform normalization method aimed at identifying and
estimating the sources of variations. The model minimizes the correlation among error terms across replicates. The
normalized gene expressions are then clustered in terms of their power spectrum density. The method of complex Granger
causality is introduced to reveal interactions between sets of genes. Complex Granger causality along with partial Granger
causality is applied in both time and frequency domains to selected as well as all the genes to reveal the interesting
networks of interactions. The approach is successfully applied to Arabidopsis leaf microarray data generated from 31,000
genes observed over 22 time points over 22 days. Three circuits: a circadian gene circuit, an ethylene circuit and a new
global circuit showing a hierarchical structure to determine the initiators of leaf senescence are analyzed in detail.
Conclusions: We use a totally data-driven approach to form biological hypothesis. Clustering using the power-spectrum
analysis helps us identify genes of potential interest. Their dynamics can be captured accurately in the time and frequency
domain using the methods of complex and partial Granger causality. With the rise in availability of temporal microarray
data, such methods can be useful tools in uncovering the hidden biological interactions. We show our method in a step by
step manner with help of toy models as well as a real biological dataset. We also analyse three distinct gene circuits of
potential interest to Arabidopsis researchers
Random Forests: some methodological insights
This paper examines from an experimental perspective random forests, the
increasingly used statistical method for classification and regression problems
introduced by Leo Breiman in 2001. It first aims at confirming, known but
sparse, advice for using random forests and at proposing some complementary
remarks for both standard problems as well as high dimensional ones for which
the number of variables hugely exceeds the sample size. But the main
contribution of this paper is twofold: to provide some insights about the
behavior of the variable importance index based on random forests and in
addition, to propose to investigate two classical issues of variable selection.
The first one is to find important variables for interpretation and the second
one is more restrictive and try to design a good prediction model. The strategy
involves a ranking of explanatory variables using the random forests score of
importance and a stepwise ascending variable introduction strategy
Recommended from our members
Biomarker discovery and redundancy reduction towards classification using a multi-factorial MALDI-TOF MS T2DM mouse model dataset
Diabetes like many diseases and biological processes is not mono-causal. On the one hand multifactorial studies with complex experimental design are required for its comprehensive analysis. On the other hand, the data from these studies often include a substantial amount of redundancy such as proteins that are typically represented by a multitude of peptides. Coping simultaneously with both complexities (experimental and technological) makes data analysis a challenge for Bioinformatics
- …