7,768 research outputs found
Correlation-Compressed Direct Coupling Analysis
Learning Ising or Potts models from data has become an important topic in
statistical physics and computational biology, with applications to predictions
of structural contacts in proteins and other areas of biological data analysis.
The corresponding inference problems are challenging since the normalization
constant (partition function) of the Ising/Potts distributions cannot be
computed efficiently on large instances. Different ways to address this issue
have hence given size to a substantial methodological literature. In this paper
we investigate how these methods could be used on much larger datasets than
studied previously. We focus on a central aspect, that in practice these
inference problems are almost always severely under-sampled, and the
operational result is almost always a small set of leading (largest)
predictions. We therefore explore an approach where the data is pre-filtered
based on empirical correlations, which can be computed directly even for very
large problems. Inference is only used on the much smaller instance in a
subsequent step of the analysis. We show that in several relevant model classes
such a combined approach gives results of almost the same quality as the
computationally much more demanding inference on the whole dataset. We also
show that results on whole-genome epistatic couplings that were obtained in a
recent computation-intensive study can be retrieved by the new approach. The
method of this paper hence opens up the possibility to learn parameters
describing pair-wise dependencies in whole genomes in a computationally
feasible and expedient manner.Comment: 15 pages, including 11 figure
Network estimation in State Space Model with L1-regularization constraint
Biological networks have arisen as an attractive paradigm of genomic science
ever since the introduction of large scale genomic technologies which carried
the promise of elucidating the relationship in functional genomics. Microarray
technologies coupled with appropriate mathematical or statistical models have
made it possible to identify dynamic regulatory networks or to measure time
course of the expression level of many genes simultaneously. However one of the
few limitations fall on the high-dimensional nature of such data coupled with
the fact that these gene expression data are known to include some hidden
process. In that regards, we are concerned with deriving a method for inferring
a sparse dynamic network in a high dimensional data setting. We assume that the
observations are noisy measurements of gene expression in the form of mRNAs,
whose dynamics can be described by some unknown or hidden process. We build an
input-dependent linear state space model from these hidden states and
demonstrate how an incorporated regularization constraint in an
Expectation-Maximization (EM) algorithm can be used to reverse engineer
transcriptional networks from gene expression profiling data. This corresponds
to estimating the model interaction parameters. The proposed method is
illustrated on time-course microarray data obtained from a well established
T-cell data. At the optimum tuning parameters we found genes TRAF5, JUND, CDK4,
CASP4, CD69, and C3X1 to have higher number of inwards directed connections and
FYB, CCNA2, AKT1 and CASP8 to be genes with higher number of outwards directed
connections. We recommend these genes to be object for further investigation.
Caspase 4 is also found to activate the expression of JunD which in turn
represses the cell cycle regulator CDC2.Comment: arXiv admin note: substantial text overlap with arXiv:1308.359
Determination of Interaction Potentials of Amino Acids from Native Protein Structures: Test on Simple Lattice Models
We propose a novel method for the determination of the effective interaction
potential between the amino acids of a protein. The strategy is based on the
combination of a new optimization procedure and a geometrical argument, which
also uncovers the shortcomings of any optimization procedure. The strategy can
be applied on any data set of native structures such as those available from
the Protein Data Bank (PDB). In this work, however, we explain and test our
approach on simple lattice models, where the true interactions are known a
priori. Excellent agreement is obtained between the extracted and the true
potentials even for modest numbers of protein structures in the PDB.
Comparisons with other methods are also discussed.Comment: 24 pages, 4 figure
- …