2,494 research outputs found
ROBUST KULLBACK-LEIBLER DIVERGENCE AND ITS APPLICATIONS IN UNIVERSAL HYPOTHESIS TESTING AND DEVIATION DETECTION
The Kullback-Leibler (KL) divergence is one of the most fundamental metrics in information theory and statistics and provides various operational interpretations in the context of mathematical communication theory and statistical hypothesis testing. The KL divergence for discrete distributions has the desired continuity property which leads to some fundamental results in universal hypothesis testing. With continuous observations, however, the KL divergence is only lower semi-continuous; difficulties arise when tackling universal hypothesis testing with continuous observations due to the lack of continuity in KL divergence.
This dissertation proposes a robust version of the KL divergence for continuous alphabets. Specifically, the KL divergence defined from a distribution to the Levy ball centered at the other distribution is found to be continuous. This robust version of the KL divergence allows one to generalize the result in universal hypothesis testing for discrete alphabets to that for continuous observations. The optimal decision rule is developed whose robust property is provably established for universal hypothesis testing.
Another application of the robust KL divergence is in deviation detection: the problem of detecting deviation from a nominal distribution using a sequence of independent and identically distributed observations. An asymptotically -optimal detector is then developed for deviation detection where the Levy metric becomes a very natural distance measure for deviation from the nominal distribution.
Lastly, the dissertation considers the following variation of a distributed detection problem: a sensor may overhear other sensors\u27 transmissions and thus may choose to refine its output in the hope of achieving a better detection performance. While this is shown to be possible for the fixed sample size test, asymptotically (in the number of samples) there is no performance gain, as measured by the KL divergence achievable at the fusion center, provided that the observations are conditionally independent. For conditionally dependent observations, however, asymptotic detection performance may indeed be improved when overhearing is utilized
Feature Extraction for Universal Hypothesis Testing via Rank-constrained Optimization
This paper concerns the construction of tests for universal hypothesis
testing problems, in which the alternate hypothesis is poorly modeled and the
observation space is large. The mismatched universal test is a feature-based
technique for this purpose. In prior work it is shown that its
finite-observation performance can be much better than the (optimal) Hoeffding
test, and good performance depends crucially on the choice of features. The
contributions of this paper include: 1) We obtain bounds on the number of
\epsilon distinguishable distributions in an exponential family. 2) This
motivates a new framework for feature extraction, cast as a rank-constrained
optimization problem. 3) We obtain a gradient-based algorithm to solve the
rank-constrained optimization problem and prove its local convergence.Comment: 5 pages, 4 figures, submitted to ISIT 201
Universal and Composite Hypothesis Testing via Mismatched Divergence
For the universal hypothesis testing problem, where the goal is to decide
between the known null hypothesis distribution and some other unknown
distribution, Hoeffding proposed a universal test in the nineteen sixties.
Hoeffding's universal test statistic can be written in terms of
Kullback-Leibler (K-L) divergence between the empirical distribution of the
observations and the null hypothesis distribution. In this paper a modification
of Hoeffding's test is considered based on a relaxation of the K-L divergence
test statistic, referred to as the mismatched divergence. The resulting
mismatched test is shown to be a generalized likelihood-ratio test (GLRT) for
the case where the alternate distribution lies in a parametric family of the
distributions characterized by a finite dimensional parameter, i.e., it is a
solution to the corresponding composite hypothesis testing problem. For certain
choices of the alternate distribution, it is shown that both the Hoeffding test
and the mismatched test have the same asymptotic performance in terms of error
exponents. A consequence of this result is that the GLRT is optimal in
differentiating a particular distribution from others in an exponential family.
It is also shown that the mismatched test has a significant advantage over the
Hoeffding test in terms of finite sample size performance. This advantage is
due to the difference in the asymptotic variances of the two test statistics
under the null hypothesis. In particular, the variance of the K-L divergence
grows linearly with the alphabet size, making the test impractical for
applications involving large alphabet distributions. The variance of the
mismatched divergence on the other hand grows linearly with the dimension of
the parameter space, and can hence be controlled through a prudent choice of
the function class defining the mismatched divergence.Comment: Accepted to IEEE Transactions on Information Theory, July 201
Statistical inference optimized with respect to the observed sample for single or multiple comparisons
The normalized maximum likelihood (NML) is a recent penalized likelihood that
has properties that justify defining the amount of discrimination information
(DI) in the data supporting an alternative hypothesis over a null hypothesis as
the logarithm of an NML ratio, namely, the alternative hypothesis NML divided
by the null hypothesis NML. The resulting DI, like the Bayes factor but unlike
the p-value, measures the strength of evidence for an alternative hypothesis
over a null hypothesis such that the probability of misleading evidence
vanishes asymptotically under weak regularity conditions and such that evidence
can support a simple null hypothesis. Unlike the Bayes factor, the DI does not
require a prior distribution and is minimax optimal in a sense that does not
involve averaging over outcomes that did not occur. Replacing a (possibly
pseudo-) likelihood function with its weighted counterpart extends the scope of
the DI to models for which the unweighted NML is undefined. The likelihood
weights leverage side information, either in data associated with comparisons
other than the comparison at hand or in the parameter value of a simple null
hypothesis. Two case studies, one involving multiple populations and the other
involving multiple biological features, indicate that the DI is robust to the
type of side information used when that information is assigned the weight of a
single observation. Such robustness suggests that very little adjustment for
multiple comparisons is warranted if the sample size is at least moderate.Comment: Typo in equation (7) of v2 corrected in equation (6) of v3; clarity
improve
The Sample Complexity of Search over Multiple Populations
This paper studies the sample complexity of searching over multiple
populations. We consider a large number of populations, each corresponding to
either distribution P0 or P1. The goal of the search problem studied here is to
find one population corresponding to distribution P1 with as few samples as
possible. The main contribution is to quantify the number of samples needed to
correctly find one such population. We consider two general approaches:
non-adaptive sampling methods, which sample each population a predetermined
number of times until a population following P1 is found, and adaptive sampling
methods, which employ sequential sampling schemes for each population. We first
derive a lower bound on the number of samples required by any sampling scheme.
We then consider an adaptive procedure consisting of a series of sequential
probability ratio tests, and show it comes within a constant factor of the
lower bound. We give explicit expressions for this constant when samples of the
populations follow Gaussian and Bernoulli distributions. An alternative
adaptive scheme is discussed which does not require full knowledge of P1, and
comes within a constant factor of the optimal scheme. For comparison, a lower
bound on the sampling requirements of any non-adaptive scheme is presented.Comment: To appear, IEEE Transactions on Information Theor
The cost of information
We develop an axiomatic theory of information acquisition that captures the
idea of constant marginal costs in information production: the cost of
generating two independent signals is the sum of their costs, and generating a
signal with probability half costs half its original cost. Together with a
monotonicity and a continuity conditions, these axioms determine the cost of a
signal up to a vector of parameters. These parameters have a clear economic
interpretation and determine the difficulty of distinguishing states. We argue
that this cost function is a versatile modeling tool that leads to more
realistic predictions than mutual information.Comment: 52 pages, 4 figure
Revisiting Chernoff Information with Likelihood Ratio Exponential Families
The Chernoff information between two probability measures is a statistical
divergence measuring their deviation defined as their maximally skewed
Bhattacharyya distance. Although the Chernoff information was originally
introduced for bounding the Bayes error in statistical hypothesis testing, the
divergence found many other applications due to its empirical robustness
property found in applications ranging from information fusion to quantum
information. From the viewpoint of information theory, the Chernoff information
can also be interpreted as a minmax symmetrization of the Kullback--Leibler
divergence. In this paper, we first revisit the Chernoff information between
two densities of a measurable Lebesgue space by considering the exponential
families induced by their geometric mixtures: The so-called likelihood ratio
exponential families. Second, we show how to (i) solve exactly the Chernoff
information between any two univariate Gaussian distributions or get a
closed-form formula using symbolic computing, (ii) report a closed-form formula
of the Chernoff information of centered Gaussians with scaled covariance
matrices and (iii) use a fast numerical scheme to approximate the Chernoff
information between any two multivariate Gaussian distributions.Comment: 41 page
- …