127,527 research outputs found

    Mutual information and conditional mean prediction error

    Get PDF
    This version: arXiv:1407.7165v1. Available from arXiv.org via the link in this recordMutual information is fundamentally important for measuring statistical dependence between variables and for quantifying information transfer by signaling and communication mechanisms. It can, however, be challenging to evaluate for physical models of such mechanisms and to estimate reliably from data. Furthermore, its relationship to better known statistical procedures is still poorly understood. Here we explore new connections between mutual information and regression-based dependence measures, ν−1\nu^{-1}, that utilise the determinant of the second-moment matrix of the conditional mean prediction error. We examine convergence properties as ν→0\nu\rightarrow0 and establish sharp lower bounds on mutual information and capacity of the form log(ν−1/2)\mathrm{log}(\nu^{-1/2}). The bounds are tighter than lower bounds based on the Pearson correlation and ones derived using average mean square-error rate distortion arguments. Furthermore, their estimation is feasible using techniques from nonparametric regression. As an illustration we provide bootstrap confidence intervals for the lower bounds which, through use of a composite estimator, substantially improve upon inference about mutual information based on kk-nearest neighbour estimators alone

    Mutual information and conditional mean prediction error

    Get PDF
    This version: arXiv:1407.7165v1. Available from arXiv.org via the link in this recordMutual information is fundamentally important for measuring statistical dependence between variables and for quantifying information transfer by signaling and communication mechanisms. It can, however, be challenging to evaluate for physical models of such mechanisms and to estimate reliably from data. Furthermore, its relationship to better known statistical procedures is still poorly understood. Here we explore new connections between mutual information and regression-based dependence measures, ν−1\nu^{-1}, that utilise the determinant of the second-moment matrix of the conditional mean prediction error. We examine convergence properties as ν→0\nu\rightarrow0 and establish sharp lower bounds on mutual information and capacity of the form log(ν−1/2)\mathrm{log}(\nu^{-1/2}). The bounds are tighter than lower bounds based on the Pearson correlation and ones derived using average mean square-error rate distortion arguments. Furthermore, their estimation is feasible using techniques from nonparametric regression. As an illustration we provide bootstrap confidence intervals for the lower bounds which, through use of a composite estimator, substantially improve upon inference about mutual information based on kk-nearest neighbour estimators alone

    LeakWatch: Estimating Information Leakage from Java Programs

    Get PDF
    Abstract. Programs that process secret data may inadvertently reveal information about those secrets in their publicly-observable output. This paper presents LeakWatch, a quantitative information leakage analysis tool for the Java programming language; it is based on a flexible “point-to-point ” information leakage model, where secret and publiclyobservable data may occur at any time during a program’s execution. LeakWatch repeatedly executes a Java program containing both secret and publicly-observable data and uses robust statistical techniques to provide estimates, with confidence intervals, for min-entropy leakage (using a new theoretical result presented in this paper) and mutual information. We demonstrate how LeakWatch can be used to estimate the size of information leaks in a range of real-world Java programs

    Evaluation of diversity, specialization, and gene specificity in transcriptomes

    Get PDF
    The transcriptome is a set of genes transcribed in a given tissue under specific conditions and can be characterized by a list of genes with their corresponding frequencies of transcription. Transcriptome changes can be measured by counting gene tags from mRNA libraries or by measuring light signals in DNA microarrays. Recently we proposed an approach to define and estimate the diversity and specialization of transcriptomes and gene specificity. This approach can be useful for the determination and measure of transcriptional networks. We defined transcriptome diversity as the Shannon entropy of its frequency distribution. Gene specificity is defined as the mutual information between the tissues and the corresponding transcript, allowing detection of either housekeeping or highly specific genes and clarifying the meaning of these concepts in the literature. Tissue specialization is measured by average gene specificity. Visualization of the positions of transcriptomes in a system of diversity and specialization coordinates makes it possible to understand at a glance their interrelations, summarizing in a powerful way which transcriptomes are richer in diversity of expressed genes, or which are relatively more specialized. This enlightens the relation among transcriptomes, allowing a better understanding of their changes through the development of the organism or in response to environmental stimuli. We present statistical tools based on resampling procedures to obtain confidence intervals for the parameters as well as perform statistical test. These approaches are illustrated with a human dataset

    Mutual Information Input Selector and Probabilistic Machine Learning Utilisation for Air Pollution Proxies

    Get PDF
    An air pollutant proxy is a mathematical model that estimates an unobserved air pollutant using other measured variables. The proxy is advantageous to fill missing data in a research campaign or to substitute a real measurement for minimising the cost as well as the operators involved (i.e., virtual sensor). In this paper, we present a generic concept of pollutant proxy development based on an optimised data-driven approach. We propose a mutual information concept to determine the interdependence of different variables and thus select the most correlated inputs. The most relevant variables are selected to be the best proxy inputs, where several metrics and data loss are also involved for guidance. The input selection method determines the used data for training pollutant proxies based on a probabilistic machine learning method. In particular, we use a Bayesian neural network that naturally prevents overfitting and provides confidence intervals around its output prediction. In this way, the prediction uncertainty could be assessed and evaluated. In order to demonstrate the effectiveness of our approach, we test it on an extensive air pollution database to estimate ozone concentration.An air pollutant proxy is a mathematical model that estimates an unobserved air pollutant using other measured variables. The proxy is advantageous to fill missing data in a research campaign or to substitute a real measurement for minimising the cost as well as the operators involved (i.e., virtual sensor). In this paper, we present a generic concept of pollutant proxy development based on an optimised data-driven approach. We propose a mutual information concept to determine the interdependence of different variables and thus select the most correlated inputs. The most relevant variables are selected to be the best proxy inputs, where several metrics and data loss are also involved for guidance. The input selection method determines the used data for training pollutant proxies based on a probabilistic machine learning method. In particular, we use a Bayesian neural network that naturally prevents overfitting and provides confidence intervals around its output prediction. In this way, the prediction uncertainty could be assessed and evaluated. In order to demonstrate the effectiveness of our approach, we test it on an extensive air pollution database to estimate ozone concentration.Peer reviewe

    Mutual Information Input Selector and Probabilistic Machine Learning Utilisation for Air Pollution Proxies

    Get PDF
    An air pollutant proxy is a mathematical model that estimates an unobserved air pollutant using other measured variables. The proxy is advantageous to fill missing data in a research campaign or to substitute a real measurement for minimising the cost as well as the operators involved (i.e., virtual sensor). In this paper, we present a generic concept of pollutant proxy development based on an optimised data-driven approach. We propose a mutual information concept to determine the interdependence of different variables and thus select the most correlated inputs. The most relevant variables are selected to be the best proxy inputs, where several metrics and data loss are also involved for guidance. The input selection method determines the used data for training pollutant proxies based on a probabilistic machine learning method. In particular, we use a Bayesian neural network that naturally prevents overfitting and provides confidence intervals around its output prediction. In this way, the prediction uncertainty could be assessed and evaluated. In order to demonstrate the effectiveness of our approach, we test it on an extensive air pollution database to estimate ozone concentration.An air pollutant proxy is a mathematical model that estimates an unobserved air pollutant using other measured variables. The proxy is advantageous to fill missing data in a research campaign or to substitute a real measurement for minimising the cost as well as the operators involved (i.e., virtual sensor). In this paper, we present a generic concept of pollutant proxy development based on an optimised data-driven approach. We propose a mutual information concept to determine the interdependence of different variables and thus select the most correlated inputs. The most relevant variables are selected to be the best proxy inputs, where several metrics and data loss are also involved for guidance. The input selection method determines the used data for training pollutant proxies based on a probabilistic machine learning method. In particular, we use a Bayesian neural network that naturally prevents overfitting and provides confidence intervals around its output prediction. In this way, the prediction uncertainty could be assessed and evaluated. In order to demonstrate the effectiveness of our approach, we test it on an extensive air pollution database to estimate ozone concentration.Peer reviewe
    • …
    corecore