13,972 research outputs found

    Rare Events and Conditional Events on Random Strings

    Get PDF
    Some strings -the texts- are assumed to be randomly generated, according to a probability model that is either a Bernoulli model or a Markov model. A rare event is the over or under-representation of a word or a set of words. The aim of this paper is twofold. First, a single word is given. One studies the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, one assumes that a given word is overrepresented. The distribution of a second word is studied; formulae for the expectation and the variance are derived. In both cases, the formulae are accurate and actually computable. These results have applications in computational biology, where a genome is viewed as a text

    Causality - Complexity - Consistency: Can Space-Time Be Based on Logic and Computation?

    Full text link
    The difficulty of explaining non-local correlations in a fixed causal structure sheds new light on the old debate on whether space and time are to be seen as fundamental. Refraining from assuming space-time as given a priori has a number of consequences. First, the usual definitions of randomness depend on a causal structure and turn meaningless. So motivated, we propose an intrinsic, physically motivated measure for the randomness of a string of bits: its length minus its normalized work value, a quantity we closely relate to its Kolmogorov complexity (the length of the shortest program making a universal Turing machine output this string). We test this alternative concept of randomness for the example of non-local correlations, and we end up with a reasoning that leads to similar conclusions as in, but is conceptually more direct than, the probabilistic view since only the outcomes of measurements that can actually all be carried out together are put into relation to each other. In the same context-free spirit, we connect the logical reversibility of an evolution to the second law of thermodynamics and the arrow of time. Refining this, we end up with a speculation on the emergence of a space-time structure on bit strings in terms of data-compressibility relations. Finally, we show that logical consistency, by which we replace the abandoned causality, it strictly weaker a constraint than the latter in the multi-party case.Comment: 17 pages, 16 figures, small correction

    On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

    Full text link
    The article presents a new interpretation for Zipf-Mandelbrot's law in natural language which rests on two areas of information theory. Firstly, we construct a new class of grammar-based codes and, secondly, we investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If a text of length nn describes nÎČn^\beta independent facts in a repetitive way then the text contains at least nÎČ/log⁥nn^\beta/\log n different words, under suitable conditions on nn. In the formal statement, two modeling postulates are adopted. Firstly, the words are understood as nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the text is assumed to be emitted by a finite-energy strongly nonergodic source whereas the facts are binary IID variables predictable in a shift-invariant way.Comment: 24 pages, no figure

    Estimating the Algorithmic Complexity of Stock Markets

    Full text link
    Randomness and regularities in Finance are usually treated in probabilistic terms. In this paper, we develop a completely different approach in using a non-probabilistic framework based on the algorithmic information theory initially developed by Kolmogorov (1965). We present some elements of this theory and show why it is particularly relevant to Finance, and potentially to other sub-fields of Economics as well. We develop a generic method to estimate the Kolmogorov complexity of numeric series. This approach is based on an iterative "regularity erasing procedure" implemented to use lossless compression algorithms on financial data. Examples are provided with both simulated and real-world financial time series. The contributions of this article are twofold. The first one is methodological : we show that some structural regularities, invisible with classical statistical tests, can be detected by this algorithmic method. The second one consists in illustrations on the daily Dow-Jones Index suggesting that beyond several well-known regularities, hidden structure may in this index remain to be identified

    Entropy of Some Models of Sparse Random Graphs With Vertex-Names

    Full text link
    Consider the setting of sparse graphs on N vertices, where the vertices have distinct "names", which are strings of length O(log N) from a fixed finite alphabet. For many natural probability models, the entropy grows as cN log N for some model-dependent rate constant c. The mathematical content of this paper is the (often easy) calculation of c for a variety of models, in particular for various standard random graph models adapted to this setting. Our broader purpose is to publicize this particular setting as a natural setting for future theoretical study of data compression for graphs, and (more speculatively) for discussion of unorganized versus organized complexity.Comment: 31 page

    Applications of Machine Learning to Threat Intelligence, Intrusion Detection and Malware

    Get PDF
    Artificial Intelligence (AI) and Machine Learning (ML) are emerging technologies with applications to many fields. This paper is a survey of use cases of ML for threat intelligence, intrusion detection, and malware analysis and detection. Threat intelligence, especially attack attribution, can benefit from the use of ML classification. False positives from rule-based intrusion detection systems can be reduced with the use of ML models. Malware analysis and classification can be made easier by developing ML frameworks to distill similarities between the malicious programs. Adversarial machine learning will also be discussed, because while ML can be used to solve problems or reduce analyst workload, it also introduces new attack surfaces

    Sharp error terms for return time statistics under mixing conditions

    Get PDF
    We describe the statistics of repetition times of a string of symbols in a stochastic process. Denote by T(A) the time elapsed until the process spells the finite string A and by S(A) the number of consecutive repetitions of A. We prove that, if the length of the string grows unbondedly, (1) the distribution of T(A), when the process starts with A, is well aproximated by a certain mixture of the point measure at the origin and an exponential law, and (2) S(A) is approximately geometrically distributed. We provide sharp error terms for each of these approximations. The errors we obtain are point-wise and allow to get also approximations for all the moments of T(A) and S(A). To obtain (1) we assume that the process is phi-mixing while to obtain (2) we assume the convergence of certain contidional probabilities

    The law of series

    Full text link
    We prove a general ergodic-theoretic result concerning the return time statistic, which, properly understood, sheds some new light on the common sense phenomenon known as {\it the law of series}. Let \proc be an ergodic process on finitely many states, with positive entropy. We show that the distribution function of the normalized waiting time for the first visit to a small cylinder set BB is, for majority of such cylinders and up to epsilon, dominated by the exponential distribution function 1−e−t1-e^{-t}. This fact has the following interpretation: The occurrences of such a "rare event" BB can deviate from purely random in only one direction -- so that for any length of an "observation period" of time, the first occurrence of BB "attracts" its further repetitions in this period

    Justifying additive-noise-model based causal discovery via algorithmic information theory

    Full text link
    A recent method for causal discovery is in many cases able to infer whether X causes Y or Y causes X for just two observed variables X and Y. It is based on the observation that there exist (non-Gaussian) joint distributions P(X,Y) for which Y may be written as a function of X up to an additive noise term that is independent of X and no such model exists from Y to X. Whenever this is the case, one prefers the causal model X--> Y. Here we justify this method by showing that the causal hypothesis Y--> X is unlikely because it requires a specific tuning between P(Y) and P(X|Y) to generate a distribution that admits an additive noise model from X to Y. To quantify the amount of tuning required we derive lower bounds on the algorithmic information shared by P(Y) and P(X|Y). This way, our justification is consistent with recent approaches for using algorithmic information theory for causal reasoning. We extend this principle to the case where P(X,Y) almost admits an additive noise model. Our results suggest that the above conclusion is more reliable if the complexity of P(Y) is high.Comment: 17 pages, 1 Figur
