15,516 research outputs found
Nonlinear Information Bottleneck
Information bottleneck (IB) is a technique for extracting information in one
random variable that is relevant for predicting another random variable
. IB works by encoding in a compressed "bottleneck" random variable
from which can be accurately decoded. However, finding the optimal
bottleneck variable involves a difficult optimization problem, which until
recently has been considered for only two limited cases: discrete and
with small state spaces, and continuous and with a Gaussian joint
distribution (in which case optimal encoding and decoding maps are linear). We
propose a method for performing IB on arbitrarily-distributed discrete and/or
continuous and , while allowing for nonlinear encoding and decoding
maps. Our approach relies on a novel non-parametric upper bound for mutual
information. We describe how to implement our method using neural networks. We
then show that it achieves better performance than the recently-proposed
"variational IB" method on several real-world datasets
Trading quantum for classical resources in quantum data compression
We study the visible compression of a source E of pure quantum signal states,
or, more formally, the minimal resources per signal required to represent
arbitrarily long strings of signals with arbitrarily high fidelity, when the
compressor is given the identity of the input state sequence as classical
information. According to the quantum source coding theorem, the optimal
quantum rate is the von Neumann entropy S(E) qubits per signal.
We develop a refinement of this theorem in order to analyze the situation in
which the states are coded into classical and quantum bits that are quantified
separately. This leads to a trade--off curve Q(R), where Q(R) qubits per signal
is the optimal quantum rate for a given classical rate of R bits per signal.
Our main result is an explicit characterization of this trade--off function
by a simple formula in terms of only single signal, perfect fidelity encodings
of the source. We give a thorough discussion of many further mathematical
properties of our formula, including an analysis of its behavior for group
covariant sources and a generalization to sources with continuously
parameterized states. We also show that our result leads to a number of
corollaries characterizing the trade--off between information gain and state
disturbance for quantum sources. In addition, we indicate how our techniques
also provide a solution to the so--called remote state preparation problem.
Finally, we develop a probability--free version of our main result which may be
interpreted as an answer to the question: ``How many classical bits does a
qubit cost?'' This theorem provides a type of dual to Holevo's theorem, insofar
as the latter characterizes the cost of coding classical bits into qubits.Comment: 51 pages, 7 figure
Forgetfulness of continuous Markovian quantum channels
The notion of forgetfulness, used in discrete quantum memory channels, is
slightly weakened in order to be applied to the case of continuous channels.
This is done in the context of quantum memory channels with Markovian noise. As
a case study, we apply the notion of weak-forgetfulness to a bosonic memory
channel with additive noise. A suitable encoding and decoding unitary
transformation allows us to unravel the effects of the memory, hence the
channel capacities can be computed using known results from the memoryless
setting.Comment: 6 pages, 2 figures, comments are welcome. Minor corrections and
acknoledgment adde
Minimum Rates of Approximate Sufficient Statistics
Given a sufficient statistic for a parametric family of distributions, one
can estimate the parameter without access to the data. However, the memory or
code size for storing the sufficient statistic may nonetheless still be
prohibitive. Indeed, for independent samples drawn from a -nomial
distribution with degrees of freedom, the length of the code scales as
. In many applications, we may not have a useful notion of
sufficient statistics (e.g., when the parametric family is not an exponential
family) and we also may not need to reconstruct the generating distribution
exactly. By adopting a Shannon-theoretic approach in which we allow a small
error in estimating the generating distribution, we construct various {\em
approximate sufficient statistics} and show that the code length can be reduced
to . We consider errors measured according to the
relative entropy and variational distance criteria. For the code constructions,
we leverage Rissanen's minimum description length principle, which yields a
non-vanishing error measured according to the relative entropy. For the
converse parts, we use Clarke and Barron's formula for the relative entropy of
a parametrized distribution and the corresponding mixture distribution.
However, this method only yields a weak converse for the variational distance.
We develop new techniques to achieve vanishing errors and we also prove strong
converses. The latter means that even if the code is allowed to have a
non-vanishing error, its length must still be at least .Comment: To appear in the IEEE Transactions on Information Theor
Learning Generative Models with Sinkhorn Divergences
The ability to compare two degenerate probability distributions (i.e. two
probability distributions supported on two distinct low-dimensional manifolds
living in a much higher-dimensional space) is a crucial problem arising in the
estimation of generative models for high-dimensional observations such as those
arising in computer vision or natural language. It is known that optimal
transport metrics can represent a cure for this problem, since they were
specifically designed as an alternative to information divergences to handle
such problematic scenarios. Unfortunately, training generative machines using
OT raises formidable computational and statistical challenges, because of (i)
the computational burden of evaluating OT losses, (ii) the instability and lack
of smoothness of these losses, (iii) the difficulty to estimate robustly these
losses and their gradients in high dimension. This paper presents the first
tractable computational method to train large scale generative models using an
optimal transport loss, and tackles these three issues by relying on two key
ideas: (a) entropic smoothing, which turns the original OT loss into one that
can be computed using Sinkhorn fixed point iterations; (b) algorithmic
(automatic) differentiation of these iterations. These two approximations
result in a robust and differentiable approximation of the OT loss with
streamlined GPU execution. Entropic smoothing generates a family of losses
interpolating between Wasserstein (OT) and Maximum Mean Discrepancy (MMD), thus
allowing to find a sweet spot leveraging the geometry of OT and the favorable
high-dimensional sample complexity of MMD which comes with unbiased gradient
estimates. The resulting computational architecture complements nicely standard
deep network generative models by a stack of extra layers implementing the loss
function
- …