434 research outputs found
Hidden Markov models Incorporating fuzzy measures and integrals for protein sequence identification and alignment
Profile hidden Markov models (HMMs) based on classical HMMs have been widely applied for protein sequence identification. The formulation of the forward and backward variables in profile HMMs is made under statistical independence assumption of the probability theory. We propose a fuzzy profile HMM to overcome the limitations of that assumption and to achieve an improved alignment for protein sequences belonging to a given family. The proposed model fuzzifies the forward and backward variables by incorporating Sugeno fuzzy measures and Choquet integrals, thus further extends the generalized HMM. Based on the fuzzified forward and backward variables, we propose a fuzzy Baum-Welch parameter estimation algorithm for profiles. The strong correlations and the sequence preference involved in the protein structures make this fuzzy architecture based model as a suitable candidate for building profiles of a given family, since the fuzzy set can handle uncertainties better than classical methods
Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs
Laplacian mixture models identify overlapping regions of influence in
unlabeled graph and network data in a scalable and computationally efficient
way, yielding useful low-dimensional representations. By combining Laplacian
eigenspace and finite mixture modeling methods, they provide probabilistic or
fuzzy dimensionality reductions or domain decompositions for a variety of input
data types, including mixture distributions, feature vectors, and graphs or
networks. Provable optimal recovery using the algorithm is analytically shown
for a nontrivial class of cluster graphs. Heuristic approximations for scalable
high-performance implementations are described and empirically tested.
Connections to PageRank and community detection in network analysis demonstrate
the wide applicability of this approach. The origins of fuzzy spectral methods,
beginning with generalized heat or diffusion equations in physics, are reviewed
and summarized. Comparisons to other dimensionality reduction and clustering
methods for challenging unsupervised machine learning problems are also
discussed.Comment: 13 figures, 35 reference
Recommended from our members
Using context to improve protein domain identification
<p>Abstract</p> <p>Background</p> <p>Identifying domains in protein sequences is an important step in protein structural and functional annotation. Existing domain recognition methods typically evaluate each domain prediction independently of the rest. However, the majority of proteins are multidomain, and pairwise domain co-occurrences are highly specific and non-transitive.</p> <p>Results</p> <p>Here, we demonstrate how to exploit domain co-occurrence to boost weak domain predictions that appear in previously observed combinations, while penalizing higher confidence domains if such combinations have never been observed. Our framework, Domain Prediction Using Context (dPUC), incorporates pairwise "context" scores between domains, along with traditional domain scores and thresholds, and improves domain prediction across a variety of organisms from bacteria to protozoa and metazoa. Among the genomes we tested, dPUC is most successful at improving predictions for the poorly-annotated malaria parasite <it>Plasmodium falciparum</it>, for which over 38% of the genome is currently unannotated. Our approach enables high-confidence annotations in this organism and the identification of orthologs to many core machinery proteins conserved in all eukaryotes, including those involved in ribosomal assembly and other RNA processing events, which surprisingly had not been previously known.</p> <p>Conclusions</p> <p>Overall, our results demonstrate that this new context-based approach will provide significant improvements in domain and function prediction, especially for poorly understood genomes for which the need for additional annotations is greatest. Source code for the algorithm is available under a GPL open source license at <url>http://compbio.cs.princeton.edu/dpuc/</url>. Pre-computed results for our test organisms and a web server are also available at that location.</p
A survey of statistical network models
Networks are ubiquitous in science and have become a focal point for
discussion in everyday life. Formal statistical models for the analysis of
network data have emerged as a major topic of interest in diverse areas of
study, and most of these involve a form of graphical representation.
Probability models on graphs date back to 1959. Along with empirical studies in
social psychology and sociology from the 1960s, these early works generated an
active network community and a substantial literature in the 1970s. This effort
moved into the statistical literature in the late 1970s and 1980s, and the past
decade has seen a burgeoning network literature in statistical physics and
computer science. The growth of the World Wide Web and the emergence of online
networking communities such as Facebook, MySpace, and LinkedIn, and a host of
more specialized professional network communities has intensified interest in
the study of networks and network data. Our goal in this review is to provide
the reader with an entry point to this burgeoning literature. We begin with an
overview of the historical development of statistical network modeling and then
we introduce a number of examples that have been studied in the network
literature. Our subsequent discussion focuses on a number of prominent static
and dynamic network models and their interconnections. We emphasize formal
model descriptions, and pay special attention to the interpretation of
parameters and their estimation. We end with a description of some open
problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference
CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS
The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research
Reconstructing regulatory networks from high-throughput post-genomic data using MCMC methods
Modern biological research aims to understand when genes are expressed and
how certain genes in
uence the expression of other genes. For organizing and visualizing
gene expression activity gene regulatory networks are used. The architecture
of these networks holds great importance, as they enable us to identify inconsistencies
between hypotheses and observations, and to predict the behavior of biological
processes in yet untested conditions.
Data from gene expression measurements are used to construct gene regulatory
networks. Along with the advance of high-throughput technologies for measuring
gene expression statistical methods to predict regulatory networks have also
been evolving. This thesis presents a computational framework based on a Bayesian
modeling technique using state space models (SSM) for the inference of gene regulatory
networks from time-series measurements.
A linear SSM consists of observation and hidden state equations. The hidden
variables can unfold effects that cannot be directly measured in an experiment, such
as missing gene expression. We have used a Bayesian MCMC approach based on
Gibbs sampling for the inference of parameters. However the task of determining
the dimension of the hidden state space variables remains crucial for the accuracy
of network inference. For this we have used the Bayesian evidence (or marginal
likelihood) as a yardstick. In addition, the Bayesian approach also provides the
possibility of incorporating prior information, based on literature knowledge.
We compare marginal likelihoods calculated from the Gibbs sampler output
to the lower bound calculated by a variational approximation. Before using the
algorithm for the analysis of real biological experimental datasets we perform validation
tests using numerical experiments based on simulated time series datasets
generated by in-silico networks. The robustness of our algorithm can be measured
by its ability to recapture the input data and generating networks using the inferred
parameters.
Our developed algorithm, GBSSM, was used to infer a gene network using
E. coli data sets from the different stress conditions of temperature shift and acid
stress. The resulting model for the gene expression response under temperature shift
captures the effects of global transcription factors, such as fnr that control the regulation
of hundreds of other genes. Interestingly, we also observe the stress-inducible membrane protein OsmC regulating transcriptional activity involved in the adaptation
mechanism under both temperature shift and acid stress conditions. In the case
of acid stress, integration of metabolomic and transcriptome data suggests that the
observed rapid decrease in the concentration of glycine betaine is the result of the
activation of osmoregulators which may play a key role in acid stress adaptation
Integrate qualitative biological knowledge for gene regulatory network reconstruction with dynamic Bayesian networks
Reconstructing gene regulatory networks, especially the dynamic gene networks that reveal the temporal program of gene expression from microarray expression data, is essential in systems biology. To overcome the challenges posed by the noisy and under-sampled microarray data, developing data fusion methods to integrate legacy biological knowledge for gene network reconstruction is a promising direction. However, large amount of qualitative biological knowledge accumulated by previous research, albeit very valuable, has received less attention for reconstructing dynamic gene networks due to its incompatibility with the quantitative computational models.;In this dissertation, I introduce a novel method to fuse qualitative gene interaction information with quantitative microarray data under the Dynamic Bayesian Networks framework. This method extends the previous data integration methods by its capabilities of both utilizing qualitative biological knowledge by using Bayesian Networks without the involvement of human experts, and taking time-series data to produce dynamic gene networks. The experimental study shows that when compared with standard Dynamic Bayesian Networks method which only uses microarray data, our method excels by both accuracy and consistency
- …