8 research outputs found

    Local Decode and Update for Big Data Compression

    Get PDF
    This paper investigates data compression that simultaneously allows local decoding and local update. The main result is a universal compression scheme for memoryless sources with the following features. The rate can be made arbitrarily close to the entropy of the underlying source, contiguous fragments of the source can be recovered or updated by probing or modifying a number of codeword bits that is on average linear in the size of the fragment, and the overall encoding and decoding complexity is quasilinear in the blocklength of the source. In particular, the local decoding or update of a single message symbol can be performed by probing or modifying on average a constant number of codeword bits. This latter part improves over previous best known results for which local decodability or update efficiency grows logarithmically with blocklength. © 1963-2012 IEEE

    O (log log n) Worst-Case Local Decoding and Update Efficiency for Data Compression

    Get PDF
    This paper addresses the problem of data compression with local decoding and local update. A compression scheme has worst-case local decoding dwc if any bit of the raw file can be recovered by probing at most dwc bits of the compressed sequence, and has update efficiency of uwc if a single bit of the raw file can be updated by modifying at most uwc bits of the compressed sequence. This article provides an entropy-achieving compression scheme for memoryless sources that simultaneously achieves O (log log n) local decoding and update efficiency. Key to this achievability result is a novel succinct data structure for sparse sequences which allows efficient local decoding and local update.Under general assumptions on the local decoder and update algorithms, a converse result shows that the maximum of dwc and uwc must grow as (log log n). © 2020 IEEE

    Capacity-achieving Polar-based LDGM Codes

    Full text link
    In this paper, we study codes with sparse generator matrices. More specifically, low-density generator matrix (LDGM) codes with a certain constraint on the weight of all the columns in the generator matrix are considered. In this paper, it is first shown that when a binary-input memoryless symmetric (BMS) channel WW and a constant s>0s>0 are given, there exists a polarization kernel such that the corresponding polar code is capacity-achieving and the column weights of the generator matrices are bounded from above by NsN^s. Then, a general construction based on a concatenation of polar codes and a rate-11 code, and a new column-splitting algorithm that guarantees a much sparser generator matrix is given. More specifically, for any BMS channel and any ϵ>2ϵ\epsilon > 2\epsilon^*, where ϵ0.085\epsilon^* \approx 0.085, an existence of sequence of capacity-achieving codes with all the column wights of the generator matrix upper bounded by (logN)1+ϵ(\log N)^{1+\epsilon} is shown. Furthermore, coding schemes for BEC and BMS channels, based on a second column-splitting algorithm are devised with low-complexity decoding that uses successive-cancellation. The second splitting algorithm allows for the use of a low-complexity decoder by preserving the reliability of the bit-channels observed by the source bits, and by increasing the code block length. In particular, for any BEC and any λ>λ=0.5+ϵ\lambda >\lambda^* = 0.5+\epsilon^*, the existence of a sequence of capacity-achieving codes where all the column wights of the generator matrix are bounded from above by (logN)2λ(\log N)^{2\lambda} and with decoding complexity O(NloglogN)O(N\log \log N) is shown. The existence of similar capacity-achieving LDGM codes with low-complexity decoding is shown for any BMS channel, and for any λ>λ0.631\lambda >\lambda^{\dagger} \approx 0.631.Comment: arXiv admin note: text overlap with arXiv:2001.1198

    A PAC-Theory of Clustering with Advice

    Get PDF
    In the absence of domain knowledge, clustering is usually an under-specified task. For any clustering application, one can choose among a variety of different clustering algorithms, along with different preprocessing techniques, that are likely to result in dramatically different answers. Any of these solutions, however, can be acceptable depending on the application, and therefore, it is critical to incorporate prior knowledge about the data and the intended semantics of clustering into the process of clustering model selection. One scenario that we study is when the user (i.e., the domain expert) provides a clustering of a (relatively small) random subset of the data set. The clustering algorithm then uses this kind of ``advice'' to come up with a data representation under which an application of a fixed clustering algorithm (e.g., k-means) results in a partition of the full data set that is aligned with the user's knowledge. We provide ``advice complexity'' of learning a representation in this paradigm. Another form of ``advice'' can be obtained by allowing the clustering algorithm to interact with a domain expert by asking same-cluster queries: ``Do these two instances belong to the same cluster?''. The goal of the clustering algorithm will then be finding a partition of the data set that is consistent with the domain expert's knowledge (yet using only a small number of queries). Aside from studying the ``advice complexity'' (i.e., query complexity) of learning in this model, we investigate the trade-offs between computational and advice complexities of learning, showing that using a little bit of advice can turn an otherwise computationally hard clustering problem into a tractable one. In the second part of this dissertation we study the problem of learning mixture models, where we are given an i.i.d. sample generated from an unknown target from a family of mixture distributions, and want to output a distribution that is close to the target in total variation distance. In particular, given a sample-efficient learner for a base class of distributions (e.g., Gaussians), we show how one can come up with a sample-efficient method for learning mixtures of the base class (e.g., mixtures of k Gaussians). As a byproduct of this analysis, we are able to prove tighter sample complexity bounds for learning various mixture models. We also investigate how having access to the same-cluster queries (i.e., whether two instances were generated from the same mixture component) can help reducing the computational burden of learning within this model. Finally, we take a further step and introduce a novel method for distribution learning via a form of compression. In particular, we ask whether one can compress a large-enough sample set generated from a target distribution (by picking only a few instances from it) in a way that allows recovery of (an approximation to) the target distribution. We prove that if this is the case for all members of a class of distributions, then there is a sample-efficient way of distribution learning with respect to this class. As an application of this novel notion, we settle the sample complexity of learning mixtures of k axis-aligned Gaussian distributions (within logarithmic factors)
    corecore