17 research outputs found
Comparison of Channels: Criteria for Domination by a Symmetric Channel
This paper studies the basic question of whether a given channel can be
dominated (in the precise sense of being more noisy) by a -ary symmetric
channel. The concept of "less noisy" relation between channels originated in
network information theory (broadcast channels) and is defined in terms of
mutual information or Kullback-Leibler divergence. We provide an equivalent
characterization in terms of -divergence. Furthermore, we develop a
simple criterion for domination by a -ary symmetric channel in terms of the
minimum entry of the stochastic matrix defining the channel . The criterion
is strengthened for the special case of additive noise channels over finite
Abelian groups. Finally, it is shown that domination by a symmetric channel
implies (via comparison of Dirichlet forms) a logarithmic Sobolev inequality
for the original channel.Comment: 31 pages, 2 figures. Presented at 2017 IEEE International Symposium
on Information Theory (ISIT
Broadcasting on Random Directed Acyclic Graphs
We study a generalization of the well-known model of broadcasting on trees.
Consider a directed acyclic graph (DAG) with a unique source vertex , and
suppose all other vertices have indegree . Let the vertices at
distance from be called layer . At layer , is given a random
bit. At layer , each vertex receives bits from its parents in
layer , which are transmitted along independent binary symmetric channel
edges, and combines them using a -ary Boolean processing function. The goal
is to reconstruct with probability of error bounded away from using
the values of all vertices at an arbitrarily deep layer. This question is
closely related to models of reliable computation and storage, and information
flow in biological networks.
In this paper, we analyze randomly constructed DAGs, for which we show that
broadcasting is only possible if the noise level is below a certain degree and
function dependent critical threshold. For , and random DAGs with
layer sizes and majority processing functions, we identify the
critical threshold. For , we establish a similar result for NAND
processing functions. We also prove a partial converse for odd
illustrating that the identified thresholds are impossible to improve by
selecting different processing functions if the decoder is restricted to using
a single vertex.
Finally, for any noise level, we construct explicit DAGs (using expander
graphs) with bounded degree and layer sizes admitting
reconstruction. In particular, we show that such DAGs can be generated in
deterministic quasi-polynomial time or randomized polylogarithmic time in the
depth. These results portray a doubly-exponential advantage for storing a bit
in DAGs compared to trees, where but layer sizes must grow exponentially
with depth in order to enable broadcasting.Comment: 33 pages, double column format. arXiv admin note: text overlap with
arXiv:1803.0752
Probabilistic Clustering Using Maximal Matrix Norm Couplings
In this paper, we present a local information theoretic approach to
explicitly learn probabilistic clustering of a discrete random variable. Our
formulation yields a convex maximization problem for which it is NP-hard to
find the global optimum. In order to algorithmically solve this optimization
problem, we propose two relaxations that are solved via gradient ascent and
alternating maximization. Experiments on the MSR Sentence Completion Challenge,
MovieLens 100K, and Reuters21578 datasets demonstrate that our approach is
competitive with existing techniques and worthy of further investigation.Comment: Presented at 56th Annual Allerton Conference on Communication,
Control, and Computing, 201
Broadcasting on Two-Dimensional Regular Grids
We study a specialization of the problem of broadcasting on directed acyclic
graphs, namely, broadcasting on 2D regular grids. Consider a 2D regular grid
with source vertex at layer and vertices at layer ,
which are at distance from . Every vertex of the 2D regular grid has
outdegree , the vertices at the boundary have indegree , and all other
vertices have indegree . At time , is given a random bit. At time
, each vertex in layer receives transmitted bits from its parents
in layer , where the bits pass through binary symmetric channels with
noise level . Then, each vertex combines its received bits
using a common Boolean processing function to produce an output bit. The
objective is to recover with probability of error better than from
all vertices at layer as . Besides their natural
interpretation in communication networks, such broadcasting processes can be
construed as 1D probabilistic cellular automata (PCA) with boundary conditions
that limit the number of sites at each time to . We conjecture that it
is impossible to propagate information in a 2D regular grid regardless of the
noise level and the choice of processing function. In this paper, we make
progress towards establishing this conjecture, and prove using ideas from
percolation and coding theory that recovery of is impossible for any
provided that all vertices use either AND or XOR processing functions.
Furthermore, we propose a martingale-based approach that establishes the
impossibility of recovering for any when all NAND processing
functions are used if certain supermartingales can be rigorously constructed.
We also provide numerical evidence for the existence of these supermartingales
by computing explicit examples for different values of via linear
programming.Comment: 52 pages, 2 figure
Bounds between contraction coefficients
In this paper, we delineate how the contraction coefficient of the strong data processing inequality for KL divergence can be used to learn likelihood models. We then present an alternative formulation that forces the input KL divergence to vanish, and achieves a contraction coefficient equivalent to the squared maximal correlation using a linear algebraic solution. To analyze the performance loss in using this simple but suboptimal procedure, we bound these coefficients in the discrete and finite regime, and prove their equivalence in the Gaussian regime
A study of local approximations in information theory
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 171-173).The intractability of many information theoretic problems arises from the meaningful but nonlinear definition of Kullback-Leibler (KL) divergence between two probability distributions. Local information theory addresses this issue by assuming all distributions of interest are perturbations of certain reference distributions, and then approximating KL divergence with a squared weighted Euclidean distance, thereby linearizing such problems. We show that large classes of statistical divergence measures, such as f-divergences and Bregman divergences, can be approximated in an analogous manner to local metrics which are very similar in form. We then capture the cost of making local approximations of KL divergence instead of using its global value. This is achieved by appropriately bounding the tightness of the Data Processing Inequality in the local and global scenarios. This task turns out to be equivalent to bounding the chordal slope of the hypercontractivity ribbon at infinity and the Hirschfeld-Gebelein-Renyi maximal correlation with each other. We derive such bounds for the discrete and finite, as well as the Gaussian regimes. An application of the local approximation technique is in understanding the large deviation behavior of sources and channels. We elucidate a source-channel decomposition of the large deviation characteristics of i.i.d. sources going through discrete memoryless channels. This is used to derive an additive Gaussian noise channel model for the local perturbations of probability distributions. We next shift our focus to infinite alphabet channels instead of discrete and finite channels. On this front, existing literature has demonstrated that the singular vectors of additive white Gaussian noise channels are Hermite polynomials, and the singular vectors of Poisson channels are Laguerre polynomials. We characterize the set of infinite alphabet channels whose singular value decompositions produce singular vectors that are orthogonal polynomials by providing equivalent conditions on the conditional moments. In doing so, we also unveil the elegant relationship between certain natural exponential families with quadratic variance functions, their conjugate priors, and their corresponding orthogonal polynomial singular vectors. Finally, we propose various related directions for future research in the hope that our work will beget more research concerning local approximation methods in information theory.by Anuran Makur.S.M
Information contraction and decomposition
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Thesis: Sc. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 327-350).Information contraction is one of the most fundamental concepts in information theory as evidenced by the numerous classical converse theorems that utilize it. In this dissertation, we study several problems aimed at better understanding this notion, broadly construed, within the intertwined realms of information theory, statistics, and discrete probability theory. In information theory, the contraction of f-divergences, such as Kullback-Leibler (KL) divergence, X²-divergence, and total variation (TV) distance, through channels (or the contraction of mutual f-information along Markov chains) is quantitatively captured by the well-known data processing inequalities.These inequalities can be tightened to produce "strong" data processing inequalities (SDPIs), which are obtained by introducing appropriate channel-dependent or source-channel-dependent "contraction coefficients." We first prove various properties of contraction coefficients of source-channel pairs, and derive linear bounds on specific classes of such contraction coefficients in terms of the contraction coefficient for X²-divergence (or the Hirschfeld-Gebelein-Rényi maximal correlation). Then, we extend the notion of an SDPI for KL divergence by analyzing when a q-ary symmetric channel dominates a given channel in the "less noisy" sense. Specifically, we develop sufficient conditions for less noisy domination using ideas of degradation and majorization, and strengthen these conditions for additive noise channels over finite Abelian groups.Furthermore, we also establish equivalent characterizations of the less noisy preorder over channels using non-linear operator convex f-divergences, and illustrate the relationship between less noisy domination and important functional inequalities such as logarithmic Sobolev inequalities. Next, adopting a more statistical and machine learning perspective, we elucidate the elegant geometry of SDPIs for X²-divergence by developing modal decompositions of bivariate distributions based on singular value decompositions of conditional expectation operators. In particular, we demonstrate that maximal correlation functions meaningfully decompose the information contained in categorical bivariate data in a local information geometric sense and serve as suitable embeddings of this data into Euclidean spaces.Moreover, we propose an extension of the well-known alternating conditional expectations algorithm to estimate maximal correlation functions from training data for the purposes of feature extraction and dimensionality reduction. We then analyze the sample complexity of this algorithm using basic matrix perturbation theory and standard concentration of measure inequalities. On a related but tangential front, we also define and study the information capacity of permutation channels. Finally, we consider the discrete probability problem of broadcasting on bounded indegree directed acyclic graphs (DAGs), which corresponds to examining the contraction of TV distance in Bayesian networks whose vertices combine their noisy input signals using Boolean processing functions.This generalizes the classical problem of broadcasting on trees and Ising models, and is closely related to results on reliable computation using noisy circuits, probabilistic cellular automata, and information flow in biological networks. Specifically, we establish phase transition phenomena for random DAGs which imply (via the probabilistic method) the existence of DAGs with logarithmic layer size where broadcasting is possible. We also construct deterministic DAGs where broadcasting is possible using expander graphs in deterministic quasi-polynomial or randomized polylogarithmic time in the depth. Lastly, we show that broadcasting is impossible for certain two-dimensional regular grids using techniques from percolation theory and coding theory.by Anuran Makur.Sc. D.Sc.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Scienc