7 research outputs found
Deep Tree Models for 'Big' Biological Data
The identification of useful temporal dependence
structure in discrete time series data is an important component
of algorithms applied to many tasks in statistical inference and
machine learning, and used in a wide variety of problems across
the spectrum of biological studies. Most of the early statistical
approaches were ineffective in practice, because the amount of
data required for reliable modelling grew exponentially with
memory length. On the other hand, many of the more modern
methodological approaches that make use of more flexible and
parsimonious models result in algorithms that do not scale
well and are computationally ineffective for larger data sets.
In this paper we describe a class of novel methodological tools
for effective Bayesian inference for general discrete time series,
motivated primarily by questions regarding data originating from
studies in genetics and neuroscience.
Our starting point is the development of a rich class of Bayesian hierarchical models for variable-memory Markov chains.
The particular prior structure we adopt makes it possible to
design effective, linear-time algorithms that can compute most of
the important features of the relevant posterior and predictive
distributions without resorting to Markov chain Monte Carlo
simulation. The origin of some of these algorithms can be traced
to the family of Context Tree Weighting (CTW) algorithms developed for data compression since the mid-1990s. We have used the
resulting methodological tools in numerous application-specific
tasks (including prediction, segmentation, classification, anomaly
detection, entropy estimation, and causality testing) on data from
different areas of application. The results obtained compare quite
favourably with those obtained using earlier approaches, such as
Probabilistic Suffix Trees (PST), Variable-Length Markov Chains
(VLMC), and the class of Markov Transition Distributions (MTD)
Estimating the Directed Information and Testing for Causality
The problem of estimating the directed information rate between two discrete processes (Xn) and (Yn) via the plug-in (or maximum-likelihood) estimator is considered. When the joint process ((Xn, Yn)) is a Markov chain of a given memory length, the plug-in estimator is shown to be asymptotically Gaussian and to converge at the optimal rate O(1/√n) under appropriate conditions; this is the first estimator that has been shown to achieve this rate. An important connection is drawn between the problem of estimating the directed information rate and that of performing a hypothesis test for the presence of causal influence between the two processes. Under fairly general conditions, the null hypothesis, which corresponds to the absence of causal influence, is equivalent to the requirement that the directed information rate be equal to zero. In that case, a finer result is established, showing that the plug-in converges at the faster rate O(1/n) and that it is asymptotically χ2-distributed. This is proved by showing that this estimator is equal to (a scalar multiple of) the classical likelihood ratio statistic for the above hypothesis test. Finally, it is noted that these results facilitate the design of an actual likelihood ratio test for the presence or absence of causal influence
Topic modelling of authentication events in an enterprise computer network
The possibility for theft or misuse of legitimate user credentials is a potential cyber-security weakness in any enterprise computer network which is almost impossible to eradicate. However, by monitoring the network traffic patterns, it can be possible to detect misuse of credentials. This article presents an initial investigation into deconvolving the mixture behaviour of several individuals within a network, to see if individual users can be identified. Towards that, a technique used for document classification is deployed, the Latent Dirichlet allocation model. A pilot study is conducted on authentication events taken from real data from the enterprise network of Los Alamos National Laboratory
Deep Tree Models for 'Big' Biological Data
The identification of useful temporal dependence structure in discrete time series data is an important component of algorithms applied to many tasks in statistical inference and machine learning, and used in a wide variety of problems across the spectrum of biological studies. Most of the early statistical approaches were ineffective in practice, because the amount of data required for reliable modelling grew exponentially with memory length. On the other hand, many of the more modern methodological approaches that make use of more flexible and parsimonious models result in algorithms that do not scale well and are computationally ineffective for larger data sets. In this paper we describe a class of novel methodological tools for effective Bayesian inference for general discrete time series, motivated primarily by questions regarding data originating from studies in genetics and neuroscience. Our starting point is the development of a rich class of Bayesian hierarchical models for variable-memory Markov chains. The particular prior structure we adopt makes it possible to design effective, linear-time algorithms that can compute most of the important features of the relevant posterior and predictive distributions without resorting to Markov chain Monte Carlo simulation. The origin of some of these algorithms can be traced to the family of Context Tree Weighting (CTW) algorithms developed for data compression since the mid-1990s. We have used the resulting methodological tools in numerous application-specific tasks (including prediction, segmentation, classification, anomaly detection, entropy estimation, and causality testing) on data from different areas of application. The results obtained compare quite favourably with those obtained using earlier approaches, such as Probabilistic Suffix Trees (PST), Variable-Length Markov Chains (VLMC), and the class of Markov Transition Distributions (MTD)
Recommended from our members
Revisiting Context-Tree Weighting for Bayesian Inference
We revisit the statistical foundation of the celebrated context tree weighting (CTW) algorithm, and we develop a Bayesian modelling framework for the class of higher-order, variable-memory Markov chains, along with an associated collection of methodological tools for exact inference for discrete time series. In addition to deterministic algorithms that learn the a posteriori most likely models and compute their posterior probabilities, we introduce a family of variable-dimension Markov chain Monte Carlo samplers, facilitating further exploration of the posterior. The performance of the proposed methods in model selection, Markov order estimation and prediction is illustrated through simulation experiments and real-world applications
Bayesian context trees: Modelling and exact inference for discrete time series
We develop a new Bayesian modelling framework for the class of higher-order,
variable-memory Markov chains, and introduce an associated collection of
methodological tools for exact inference with discrete time series. We show
that a version of the context tree weighting algorithm can compute the prior
predictive likelihood exactly (averaged over both models and parameters), and
two related algorithms are introduced, which identify the a posteriori most
likely models and compute their exact posterior probabilities. All three
algorithms are deterministic and have linear-time complexity. A family of
variable-dimension Markov chain Monte Carlo samplers is also provided,
facilitating further exploration of the posterior. The performance of the
proposed methods in model selection, Markov order estimation and prediction is
illustrated through simulation experiments and real-world applications with
data from finance, genetics, neuroscience, and animal communication. The
associated algorithms are implemented in the R package BCT
Bayesian Context Trees: Modelling and exact inference for discrete time series
We develop a new Bayesian modelling framework for the class of higher-order, variable-memory Markov chains, and introduce an associated collection of methodological tools for exact inference with discrete time series. We show that a version of the context tree weighting algorithm can compute the prior predictive likelihood exactly (averaged over both models and parameters), and two related algorithms are introduced, which identify the a posteriori most likely models and compute their exact posterior probabilities. All three algorithms are deterministic and have linear-time complexity. A family of variable-dimension Markov chain Monte Carlo samplers is also provided, facilitating further exploration of the posterior. The performance of the proposed methods in model selection, Markov order estimation and prediction is illustrated through simulation experiments and real-world applications with data from finance, genetics, neuroscience, and animal communication. The associated algorithms are implemented in the R package BCT