3,037 research outputs found
Markov Bases for Typical Block Effect Models of Two-way Contingency Tables
Markov basis for statistical model of contingency tables gives a useful tool
for performing the conditional test of the model via Markov chain Monte Carlo
method. In this paper we derive explicit forms of Markov bases for change point
models and block diagonal effect models, which are typical block-wise effect
models of two-way contingency tables, and perform conditional tests with some
real data sets.Comment: 16 page
Markov chain Monte Carlo tests for designed experiments
We consider conditional exact tests of factor effects in designed experiments
for discrete response variables. Similarly to the analysis of contingency
tables, a Markov chain Monte Carlo method can be used for performing exact
tests, when large-sample approximations are poor and the enumeration of the
conditional sample space is infeasible. For designed experiments with a single
observation for each run, we formulate log-linear or logistic models and
consider a connected Markov chain over an appropriate sample space. In
particular, we investigate fractional factorial designs with runs,
noting correspondences to the models for contingency tables
Goodness of fit for log-linear ERGMs
Many popular models from the networks literature can be viewed through a
common lens of contingency tables on network dyads, resulting in
\emph{log-linear ERGMs}: exponential family models for random graphs whose
sufficient statistics are linear on the dyads. We propose a new model in this
family, the \emph{-SBM}, which combines node and group effects common in
network formation mechanisms. In particular, it is a generalization of several
well-known ERGMs including the stochastic blockmodel for undirected graphs, the
degree-corrected version of it, and the directed model without group
structure.
We frame the problem of testing model fit for the log-linear ERGM class
through an exact conditional test whose -value can be approximated
efficiently in networks of both small and moderately large sizes. The sampling
methods we build rely on a dynamic adaptation of Markov bases. We use quick
estimation algorithms adapted from the contingency table literature and
effective sampling methods rooted in graph theory and algebraic statistics. The
performance and scalability of the method is demonstrated on two data sets from
biology: the connectome of \emph{C. elegans} and the interactome of
\emph{Arabidopsis thaliana}. These two networks -- a neuronal network and a
protein-protein interaction network -- have been popular examples in the
network science literature. Our work provides a model-based approach to
studying them
Sequences of regressions and their independences
Ordered sequences of univariate or multivariate regressions provide
statistical models for analysing data from randomized, possibly sequential
interventions, from cohort or multi-wave panel studies, but also from
cross-sectional or retrospective studies. Conditional independences are
captured by what we name regression graphs, provided the generated distribution
shares some properties with a joint Gaussian distribution. Regression graphs
extend purely directed, acyclic graphs by two types of undirected graph, one
type for components of joint responses and the other for components of the
context vector variable. We review the special features and the history of
regression graphs, derive criteria to read all implied independences of a
regression graph and prove criteria for Markov equivalence that is to judge
whether two different graphs imply the same set of independence statements.
Knowledge of Markov equivalence provides alternative interpretations of a given
sequence of regressions, is essential for machine learning strategies and
permits to use the simple graphical criteria of regression graphs on graphs for
which the corresponding criteria are in general more complex. Under the known
conditions that a Markov equivalent directed acyclic graph exists for any given
regression graph, we give a polynomial time algorithm to find one such graph.Comment: 43 pages with 17 figures The manuscript is to appear as an invited
discussion paper in the journal TES
Graphical Markov models, unifying results and their interpretation
Graphical Markov models combine conditional independence constraints with
graphical representations of stepwise data generating processes.The models
started to be formulated about 40 years ago and vigorous development is
ongoing. Longitudinal observational studies as well as intervention studies are
best modeled via a subclass called regression graph models and, especially
traceable regressions. Regression graphs include two types of undirected graph
and directed acyclic graphs in ordered sequences of joint responses. Response
components may correspond to discrete or continuous random variables and may
depend exclusively on variables which have been generated earlier. These
aspects are essential when causal hypothesis are the motivation for the
planning of empirical studies.
To turn the graphs into useful tools for tracing developmental pathways and
for predicting structure in alternative models, the generated distributions
have to mimic some properties of joint Gaussian distributions. Here, relevant
results concerning these aspects are spelled out and illustrated by examples.
With regression graph models, it becomes feasible, for the first time, to
derive structural effects of (1) ignoring some of the variables, of (2)
selecting subpopulations via fixed levels of some other variables or of (3)
changing the order in which the variables might get generated. Thus, the most
important future applications of these models will aim at the best possible
integration of knowledge from related studies.Comment: 34 Pages, 11 figures, 1 tabl
A survey of statistical network models
Networks are ubiquitous in science and have become a focal point for
discussion in everyday life. Formal statistical models for the analysis of
network data have emerged as a major topic of interest in diverse areas of
study, and most of these involve a form of graphical representation.
Probability models on graphs date back to 1959. Along with empirical studies in
social psychology and sociology from the 1960s, these early works generated an
active network community and a substantial literature in the 1970s. This effort
moved into the statistical literature in the late 1970s and 1980s, and the past
decade has seen a burgeoning network literature in statistical physics and
computer science. The growth of the World Wide Web and the emergence of online
networking communities such as Facebook, MySpace, and LinkedIn, and a host of
more specialized professional network communities has intensified interest in
the study of networks and network data. Our goal in this review is to provide
the reader with an entry point to this burgeoning literature. We begin with an
overview of the historical development of statistical network modeling and then
we introduce a number of examples that have been studied in the network
literature. Our subsequent discussion focuses on a number of prominent static
and dynamic network models and their interconnections. We emphasize formal
model descriptions, and pay special attention to the interpretation of
parameters and their estimation. We end with a description of some open
problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference
Addressing the unmet need for visualizing Conditional Random Fields in Biological Data
Background: The biological world is replete with phenomena that appear to be
ideally modeled and analyzed by one archetypal statistical framework - the
Graphical Probabilistic Model (GPM). The structure of GPMs is a uniquely good
match for biological problems that range from aligning sequences to modeling
the genome-to-phenome relationship. The fundamental questions that GPMs address
involve making decisions based on a complex web of interacting factors.
Unfortunately, while GPMs ideally fit many questions in biology, they are not
an easy solution to apply. Building a GPM is not a simple task for an end user.
Moreover, applying GPMs is also impeded by the insidious fact that the complex
web of interacting factors inherent to a problem might be easy to define and
also intractable to compute upon. Discussion: We propose that the visualization
sciences can contribute to many domains of the bio-sciences, by developing
tools to address archetypal representation and user interaction issues in GPMs,
and in particular a variety of GPM called a Conditional Random Field(CRF). CRFs
bring additional power, and additional complexity, because the CRF dependency
network can be conditioned on the query data. Conclusions: In this manuscript
we examine the shared features of several biological problems that are amenable
to modeling with CRFs, highlight the challenges that existing visualization and
visual analytics paradigms induce for these data, and document an experimental
solution called StickWRLD which, while leaving room for improvement, has been
successfully applied in several biological research projects.Comment: BioVis 2014 conferenc
Algebraic Statistics in Practice: Applications to Networks
Algebraic statistics uses tools from algebra (especially from multilinear
algebra, commutative algebra and computational algebra), geometry and
combinatorics to provide insight into knotty problems in mathematical
statistics. In this survey we illustrate this on three problems related to
networks, namely network models for relational data, causal structure discovery
and phylogenetics. For each problem we give an overview of recent results in
algebraic statistics with emphasis on the statistical achievements made
possible by these tools and their practical relevance for applications to other
scientific disciplines
Recommended from our members
Transcription regulation: models for combinatorial regulation and functional specificity
Gene regulation id controlled by transcription factor proteins that bind to specific DNA sequences, known as transcription factor binding sites (TFBSs). Combinations of transcription factors working, co-operatively in cis-regulatory modules (CRMs), play a role in regulating gene expression. Current computational methods for TFBS prediction cannot distinguish between functional and non-functional sites, and predict very large numbers of false positives.
The thesis focuses on the development of a novel computational model, based on artificial neural networks (ANNs), for the identification of functional TFBSs, and the CRMs within which they operate in the human genome. Datasets of 12,239 experimentally verified true positive (TP) TFBSs and 130,199 false positive (FP) TFBSs were extracted using a combination of position weight matrices from the JASPAR database and experimentally verified sites from the Encyclopedia of DNA elements (ENCODE). A number of machine learning alsgorithms were tested using a range of genetic information including gene expression, necleosome positioning, DNA methylation states and DNA entropy. The best model, that gave a mean area under the curve under a receiver operator characteristic curve of 0.800, was based on a feedforward ANN using backpropagation.
This model was then used to predict functional TFBSs in a number of gene sets from the human genome. The predictions, combined with experimentally proven TFBSs from ENCODE, were used to investigate combinatorial [atterns of TFBSs operating in CRMs. CRM patterns have been analysed in disease-associated genes located in linkage disequilibrium blocks containing SNPs obtained from Genome Wide Association Studies (GWAS).
The potential for the model to make functional TFBS predictions to aid in the annotation of orphan genes of unknown function is discussed. In addition this thesis presents computational work on a number of smaller published studies
- …