787 research outputs found
Coding of non-stationary sources as a foundation for detecting change points and outliers in binary time-series
An interesting scheme for estimating and adapting distributions in real-time for non-stationary data has recently been the focus of study for several different tasks relating to time series and data mining, namely change point detection, outlier detection and online compression/sequence prediction. An appealing feature is that unlike more sophisticated procedures, it is as fast as the related stationary procedures which are simply modified through discounting or windowing. The discount scheme makes older observations lose their influence on new predictions. The authors of this article recently used a discount scheme for introducing an adaptive version of the Context Tree Weighting compression algorithm. The mentioned change point and outlier detection methods rely on the changing compression ratio of an online compression algorithm. Here we are beginning to provide theoretical foundations for the use of these adaptive estimation procedures that have already shown practical promise
A Universal Parallel Two-Pass MDL Context Tree Compression Algorithm
Computing problems that handle large amounts of data necessitate the use of
lossless data compression for efficient storage and transmission. We present a
novel lossless universal data compression algorithm that uses parallel
computational units to increase the throughput. The length- input sequence
is partitioned into blocks. Processing each block independently of the
other blocks can accelerate the computation by a factor of , but degrades
the compression quality. Instead, our approach is to first estimate the minimum
description length (MDL) context tree source underlying the entire input, and
then encode each of the blocks in parallel based on the MDL source. With
this two-pass approach, the compression loss incurred by using more parallel
units is insignificant. Our algorithm is work-efficient, i.e., its
computational complexity is . Its redundancy is approximately
bits above Rissanen's lower bound on universal compression
performance, with respect to any context tree source whose maximal depth is at
most . We improve the compression by using different quantizers for
states of the context tree based on the number of symbols corresponding to
those states. Numerical results from a prototype implementation suggest that
our algorithm offers a better trade-off between compression and throughput than
competing universal data compression algorithms.Comment: Accepted to Journal of Selected Topics in Signal Processing special
issue on Signal Processing for Big Data (expected publication date June
2015). 10 pages double column, 6 figures, and 2 tables. arXiv admin note:
substantial text overlap with arXiv:1405.6322. Version: Mar 2015: Corrected a
typ
Universal Estimation of Directed Information
Four estimators of the directed information rate between a pair of jointly
stationary ergodic finite-alphabet processes are proposed, based on universal
probability assignments. The first one is a Shannon--McMillan--Breiman type
estimator, similar to those used by Verd\'u (2005) and Cai, Kulkarni, and
Verd\'u (2006) for estimation of other information measures. We show the almost
sure and convergence properties of the estimator for any underlying
universal probability assignment. The other three estimators map universal
probability assignments to different functionals, each exhibiting relative
merits such as smoothness, nonnegativity, and boundedness. We establish the
consistency of these estimators in almost sure and senses, and derive
near-optimal rates of convergence in the minimax sense under mild conditions.
These estimators carry over directly to estimating other information measures
of stationary ergodic finite-alphabet processes, such as entropy rate and
mutual information rate, with near-optimal performance and provide alternatives
to classical approaches in the existing literature. Guided by these theoretical
results, the proposed estimators are implemented using the context-tree
weighting algorithm as the universal probability assignment. Experiments on
synthetic and real data are presented, demonstrating the potential of the
proposed schemes in practice and the utility of directed information estimation
in detecting and measuring causal influence and delay.Comment: 23 pages, 10 figures, to appear in IEEE Transactions on Information
Theor
Data Discovery and Anomaly Detection Using Atypicality: Theory
A central question in the era of 'big data' is what to do with the enormous
amount of information. One possibility is to characterize it through
statistics, e.g., averages, or classify it using machine learning, in order to
understand the general structure of the overall data. The perspective in this
paper is the opposite, namely that most of the value in the information in some
applications is in the parts that deviate from the average, that are unusual,
atypical. We define what we mean by 'atypical' in an axiomatic way as data that
can be encoded with fewer bits in itself rather than using the code for the
typical data. We show that this definition has good theoretical properties. We
then develop an implementation based on universal source coding, and apply this
to a number of real world data sets.Comment: 40 page
Large-alphabet sequence modelling - a comparative study
Most raw data is not binary, but over some often large and structured alphabet. Sometimes it is convenient to deal with binarised data sequence, but typically exploiting the original structure of the data significantly improves performance in many practical applications. In this thesis, we study Martin-Lof random sequences that are maximally incompressible and provide a topological view on the size of the set of random sequences. We also investigate the relationship between binary data compression techniques and modelling natural language text with the latter using raw unbinarised data sequence from a large alphabet. We perform an experimental comparative study for them, including an empirical comparison between Kneser-Ney (KN) variants with regular Context Tree Weighting algorithm (CTW) and phase CTW, and with large-alphabet CTW with different estimators. We also apply the idea of Hutter's adaptive sparse Dirichlet-multinomial coding to the KN method and provide a heuristic to make the discounting parameter adaptive. The KN with this adaptive discounting parameter outperforms the traditional KN method on the Large Calgary corpus
Top Down Electroweak Dipole Operators
We derive present constraints on, and prospective sensitivity to, the
electric dipole moment (EDM) of the top quark () implied by searches for
the EDMs of the electron and nucleons. Above the electroweak scale , the
arises from two gauge invariant operators generated at a scale that also mix with the light fermion EDMs under renormalization group
evolution at two-loop order. Bounds on the EDMs of first generation fermion
systems thus imply bounds on . Working in the leading log-squared
approximation, we find that the present upper bound on is roughly
cm for TeV, except in regions of finely tuned
cancellations that allow for to be up to fifty times larger. Future
and probes may yield an order of magnitude increase in
sensitivity, while inclusion of a prospective proton EDM search may lead to an
additional increase in reach.Comment: 7 pages, 6 figure
Context tree switching
This paper describes the Context Tree Switching technique, a modification of Context Tree
Weighting for the prediction of binary, stationary, n-Markov sources. By modifying Context
Tree Weighting’s recursive weighting scheme, it is possible to mix over a strictly larger class of
models without increasing the asymptotic time or space complexity of the original algorithm.
We prove that this generalization preserves the desirable theoretical properties of Context Tree
Weighting on stationary n-Markov sources, and show empirically that this new technique leads
to consistent improvements over Context Tree Weighting as measured on the Calgary Corpus
- …