8,012 research outputs found
Using Markov Models and Statistics to Learn, Extract, Fuse, and Detect Patterns in Raw Data
Many systems are partially stochastic in nature. We have derived data driven
approaches for extracting stochastic state machines (Markov models) directly
from observed data. This chapter provides an overview of our approach with
numerous practical applications. We have used this approach for inferring
shipping patterns, exploiting computer system side-channel information, and
detecting botnet activities. For contrast, we include a related data-driven
statistical inferencing approach that detects and localizes radiation sources.Comment: Accepted by 2017 International Symposium on Sensor Networks, Systems
and Securit
Rooting a phylogenetic tree with nonreversible substitution models
BACKGROUND: We compared two methods of rooting a phylogenetic tree: the stationary and the nonstationary substitution processes. These methods do not require an outgroup. METHODS: Given a multiple alignment and an unrooted tree, the maximum likelihood estimates of branch lengths and substitution parameters for each associated rooted tree are found; rooted trees are compared using their likelihood values. Site variation in substitution rates is handled by assigning sites into several classes before the analysis. RESULTS: In three test datasets where the trees are small and the roots are assumed known, the nonstationary process gets the correct estimate significantly more often, and fits data much better, than the stationary process. Both processes give biologically plausible root placements in a set of nine primate mitochondrial DNA sequences. CONCLUSIONS: The nonstationary process is simple to use and is much better than the stationary process at inferring the root. It could be useful for situations where an outgroup is unavailable
The Computational Structure of Spike Trains
Neurons perform computations, and convey the results of those computations
through the statistical structure of their output spike trains. Here we present
a practical method, grounded in the information-theoretic analysis of
prediction, for inferring a minimal representation of that structure and for
characterizing its complexity. Starting from spike trains, our approach finds
their causal state models (CSMs), the minimal hidden Markov models or
stochastic automata capable of generating statistically identical time series.
We then use these CSMs to objectively quantify both the generalizable structure
and the idiosyncratic randomness of the spike train. Specifically, we show that
the expected algorithmic information content (the information needed to
describe the spike train exactly) can be split into three parts describing (1)
the time-invariant structure (complexity) of the minimal spike-generating
process, which describes the spike train statistically; (2) the randomness
(internal entropy rate) of the minimal spike-generating process; and (3) a
residual pure noise term not described by the minimal spike-generating process.
We use CSMs to approximate each of these quantities. The CSMs are inferred
nonparametrically from the data, making only mild regularity assumptions, via
the causal state splitting reconstruction algorithm. The methods presented here
complement more traditional spike train analyses by describing not only spiking
probability and spike train entropy, but also the complexity of a spike train's
structure. We demonstrate our approach using both simulated spike trains and
experimental data recorded in rat barrel cortex during vibrissa stimulation.Comment: Somewhat different format from journal version but same conten
A hierarchical Bayesian model for inference of copy number variants and their association to gene expression
A number of statistical models have been successfully developed for the
analysis of high-throughput data from a single source, but few methods are
available for integrating data from different sources. Here we focus on
integrating gene expression levels with comparative genomic hybridization (CGH)
array measurements collected on the same subjects. We specify a measurement
error model that relates the gene expression levels to latent copy number
states which, in turn, are related to the observed surrogate CGH measurements
via a hidden Markov model. We employ selection priors that exploit the
dependencies across adjacent copy number states and investigate MCMC stochastic
search techniques for posterior inference. Our approach results in a unified
modeling framework for simultaneously inferring copy number variants (CNV) and
identifying their significant associations with mRNA transcripts abundance. We
show performance on simulated data and illustrate an application to data from a
genomic study on human cancer cell lines.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS705 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Causal inference using the algorithmic Markov condition
Inferring the causal structure that links n observables is usually based upon
detecting statistical dependences and choosing simple graphs that make the
joint measure Markovian. Here we argue why causal inference is also possible
when only single observations are present.
We develop a theory how to generate causal graphs explaining similarities
between single objects. To this end, we replace the notion of conditional
stochastic independence in the causal Markov condition with the vanishing of
conditional algorithmic mutual information and describe the corresponding
causal inference rules.
We explain why a consistent reformulation of causal inference in terms of
algorithmic complexity implies a new inference principle that takes into
account also the complexity of conditional probability densities, making it
possible to select among Markov equivalent causal graphs. This insight provides
a theoretical foundation of a heuristic principle proposed in earlier work.
We also discuss how to replace Kolmogorov complexity with decidable
complexity criteria. This can be seen as an algorithmic analog of replacing the
empirically undecidable question of statistical independence with practical
independence tests that are based on implicit or explicit assumptions on the
underlying distribution.Comment: 16 figure
- …