13,972 research outputs found
Rare Events and Conditional Events on Random Strings
Some strings -the texts- are assumed to be randomly generated, according to a probability model that is either a Bernoulli model or a Markov model. A rare event is the over or under-representation of a word or a set of words. The aim of this paper is twofold. First, a single word is given. One studies the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, one assumes that a given word is overrepresented. The distribution of a second word is studied; formulae for the expectation and the variance are derived. In both cases, the formulae are accurate and actually computable. These results have applications in computational biology, where a genome is viewed as a text
Causality - Complexity - Consistency: Can Space-Time Be Based on Logic and Computation?
The difficulty of explaining non-local correlations in a fixed causal
structure sheds new light on the old debate on whether space and time are to be
seen as fundamental. Refraining from assuming space-time as given a priori has
a number of consequences. First, the usual definitions of randomness depend on
a causal structure and turn meaningless. So motivated, we propose an intrinsic,
physically motivated measure for the randomness of a string of bits: its length
minus its normalized work value, a quantity we closely relate to its Kolmogorov
complexity (the length of the shortest program making a universal Turing
machine output this string). We test this alternative concept of randomness for
the example of non-local correlations, and we end up with a reasoning that
leads to similar conclusions as in, but is conceptually more direct than, the
probabilistic view since only the outcomes of measurements that can actually
all be carried out together are put into relation to each other. In the same
context-free spirit, we connect the logical reversibility of an evolution to
the second law of thermodynamics and the arrow of time. Refining this, we end
up with a speculation on the emergence of a space-time structure on bit strings
in terms of data-compressibility relations. Finally, we show that logical
consistency, by which we replace the abandoned causality, it strictly weaker a
constraint than the latter in the multi-party case.Comment: 17 pages, 16 figures, small correction
On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts
The article presents a new interpretation for Zipf-Mandelbrot's law in
natural language which rests on two areas of information theory. Firstly, we
construct a new class of grammar-based codes and, secondly, we investigate
properties of strongly nonergodic stationary processes. The motivation for the
joint discussion is to prove a proposition with a simple informal statement: If
a text of length describes independent facts in a repetitive way
then the text contains at least different words, under
suitable conditions on . In the formal statement, two modeling postulates
are adopted. Firstly, the words are understood as nonterminal symbols of the
shortest grammar-based encoding of the text. Secondly, the text is assumed to
be emitted by a finite-energy strongly nonergodic source whereas the facts are
binary IID variables predictable in a shift-invariant way.Comment: 24 pages, no figure
Estimating the Algorithmic Complexity of Stock Markets
Randomness and regularities in Finance are usually treated in probabilistic
terms. In this paper, we develop a completely different approach in using a
non-probabilistic framework based on the algorithmic information theory
initially developed by Kolmogorov (1965). We present some elements of this
theory and show why it is particularly relevant to Finance, and potentially to
other sub-fields of Economics as well. We develop a generic method to estimate
the Kolmogorov complexity of numeric series. This approach is based on an
iterative "regularity erasing procedure" implemented to use lossless
compression algorithms on financial data. Examples are provided with both
simulated and real-world financial time series. The contributions of this
article are twofold. The first one is methodological : we show that some
structural regularities, invisible with classical statistical tests, can be
detected by this algorithmic method. The second one consists in illustrations
on the daily Dow-Jones Index suggesting that beyond several well-known
regularities, hidden structure may in this index remain to be identified
Entropy of Some Models of Sparse Random Graphs With Vertex-Names
Consider the setting of sparse graphs on N vertices, where the vertices have
distinct "names", which are strings of length O(log N) from a fixed finite
alphabet. For many natural probability models, the entropy grows as cN log N
for some model-dependent rate constant c. The mathematical content of this
paper is the (often easy) calculation of c for a variety of models, in
particular for various standard random graph models adapted to this setting.
Our broader purpose is to publicize this particular setting as a natural
setting for future theoretical study of data compression for graphs, and (more
speculatively) for discussion of unorganized versus organized complexity.Comment: 31 page
Applications of Machine Learning to Threat Intelligence, Intrusion Detection and Malware
Artificial Intelligence (AI) and Machine Learning (ML) are emerging technologies with applications to many fields. This paper is a survey of use cases of ML for threat intelligence, intrusion detection, and malware analysis and detection. Threat intelligence, especially attack attribution, can benefit from the use of ML classification. False positives from rule-based intrusion detection systems can be reduced with the use of ML models. Malware analysis and classification can be made easier by developing ML frameworks to distill similarities between the malicious programs. Adversarial machine learning will also be discussed, because while ML can be used to solve problems or reduce analyst workload, it also introduces new attack surfaces
Sharp error terms for return time statistics under mixing conditions
We describe the statistics of repetition times of a string of symbols in a
stochastic process. Denote by T(A) the time elapsed until the process spells
the finite string A and by S(A) the number of consecutive repetitions of A. We
prove that, if the length of the string grows unbondedly, (1) the distribution
of T(A), when the process starts with A, is well aproximated by a certain
mixture of the point measure at the origin and an exponential law, and (2) S(A)
is approximately geometrically distributed. We provide sharp error terms for
each of these approximations. The errors we obtain are point-wise and allow to
get also approximations for all the moments of T(A) and S(A). To obtain (1) we
assume that the process is phi-mixing while to obtain (2) we assume the
convergence of certain contidional probabilities
The law of series
We prove a general ergodic-theoretic result concerning the return time
statistic, which, properly understood, sheds some new light on the common sense
phenomenon known as {\it the law of series}. Let \proc be an ergodic process on
finitely many states, with positive entropy. We show that the distribution
function of the normalized waiting time for the first visit to a small cylinder
set is, for majority of such cylinders and up to epsilon, dominated by the
exponential distribution function . This fact has the following
interpretation: The occurrences of such a "rare event" can deviate from
purely random in only one direction -- so that for any length of an
"observation period" of time, the first occurrence of "attracts" its
further repetitions in this period
Justifying additive-noise-model based causal discovery via algorithmic information theory
A recent method for causal discovery is in many cases able to infer whether X
causes Y or Y causes X for just two observed variables X and Y. It is based on
the observation that there exist (non-Gaussian) joint distributions P(X,Y) for
which Y may be written as a function of X up to an additive noise term that is
independent of X and no such model exists from Y to X. Whenever this is the
case, one prefers the causal model X--> Y.
Here we justify this method by showing that the causal hypothesis Y--> X is
unlikely because it requires a specific tuning between P(Y) and P(X|Y) to
generate a distribution that admits an additive noise model from X to Y. To
quantify the amount of tuning required we derive lower bounds on the
algorithmic information shared by P(Y) and P(X|Y). This way, our justification
is consistent with recent approaches for using algorithmic information theory
for causal reasoning. We extend this principle to the case where P(X,Y) almost
admits an additive noise model.
Our results suggest that the above conclusion is more reliable if the
complexity of P(Y) is high.Comment: 17 pages, 1 Figur
- âŠ