    Generalized residual vector quantization for large scale data

    Vector quantization is an essential tool for tasks involving large scale data, for example, large scale similarity search, which is crucial for content-based information retrieval and analysis. In this paper, we propose a novel vector quantization framework that iteratively minimizes quantization error. First, we provide a detailed review on a relevant vector quantization method named \textit{residual vector quantization} (RVQ). Next, we propose \textit{generalized residual vector quantization} (GRVQ) to further improve over RVQ. Many vector quantization methods can be viewed as the special cases of our proposed framework. We evaluate GRVQ on several large scale benchmark datasets for large scale search, classification and object retrieval. We compared GRVQ with existing methods in detail. Extensive experiments demonstrate our GRVQ framework substantially outperforms existing methods in term of quantization accuracy and computation efficiency.Comment: published on International Conference on Multimedia and Expo 201

    Privacy-Preserving Shortest Path Computation

    Navigation is one of the most popular cloud computing services. But in virtually all cloud-based navigation systems, the client must reveal her location and destination to the cloud service provider in order to learn the fastest route. In this work, we present a cryptographic protocol for navigation on city streets that provides privacy for both the client's location and the service provider's routing data. Our key ingredient is a novel method for compressing the next-hop routing matrices in networks such as city street maps. Applying our compression method to the map of Los Angeles, for example, we achieve over tenfold reduction in the representation size. In conjunction with other cryptographic techniques, this compressed representation results in an efficient protocol suitable for fully-private real-time navigation on city streets. We demonstrate the practicality of our protocol by benchmarking it on real street map data for major cities such as San Francisco and Washington, D.C.Comment: Extended version of NDSS 2016 pape

    Compressing Word Embeddings

    Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic. However, these vector space representations (created through large-scale text analysis) are typically stored verbatim, since their internal structure is opaque. Using word-analogy tests to monitor the level of detail stored in compressed re-representations of the same vector space, the trade-offs between the reduction in memory usage and expressiveness are investigated. A simple scheme is outlined that can reduce the memory footprint of a state-of-the-art embedding by a factor of 10, with only minimal impact on performance. Then, using the same `bit budget', a binary (approximate) factorisation of the same space is also explored, with the aim of creating an equivalent representation with better interpretability.Comment: 10 pages, 0 figures, submitted to ICONIP-2016. Previous experimental results were submitted to ICLR-2016, but the paper has been significantly updated, since a new experimental set-up worked much bette

    Compressing Sparse Sequences under Local Decodability Constraints

    We consider a variable-length source coding problem subject to local decodability constraints. In particular, we investigate the blocklength scaling behavior attainable by encodings of rr-sparse binary sequences, under the constraint that any source bit can be correctly decoded upon probing at most dd codeword bits. We consider both adaptive and non-adaptive access models, and derive upper and lower bounds that often coincide up to constant factors. Notably, such a characterization for the fixed-blocklength analog of our problem remains unknown, despite considerable research over the last three decades. Connections to communication complexity are also briefly discussed.Comment: 8 pages, 1 figure. First five pages to appear in 2015 International Symposium on Information Theory. This version contains supplementary materia

    Connectivity Compression for Irregular Quadrilateral Meshes

    Applications that require Internet access to remote 3D datasets are often limited by the storage costs of 3D models. Several compression methods are available to address these limits for objects represented by triangle meshes. Many CAD and VRML models, however, are represented as quadrilateral meshes or mixed triangle/quadrilateral meshes, and these models may also require compression. We present an algorithm for encoding the connectivity of such quadrilateral meshes, and we demonstrate that by preserving and exploiting the original quad structure, our approach achieves encodings 30 - 80% smaller than an approach based on randomly splitting quads into triangles. We present both a code with a proven worst-case cost of 3 bits per vertex (or 2.75 bits per vertex for meshes without valence-two vertices) and entropy-coding results for typical meshes ranging from 0.3 to 0.9 bits per vertex, depending on the regularity of the mesh. Our method may be implemented by a rule for a particular splitting of quads into triangles and by using the compression and decompression algorithms introduced in [Rossignac99] and [Rossignac&Szymczak99]. We also present extensions to the algorithm to compress meshes with holes and handles and meshes containing triangles and other polygons as well as quads

    On the Hardness and Inapproximability of Recognizing Wheeler Graphs

    In recent years several compressed indexes based on variants of the Burrows-Wheeler transformation have been introduced. Some of these are used to index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an increasing effort to better understand under which conditions such an indexing scheme is possible. This has led to the introduction of Wheeler graphs [Gagie et al., Theor. Comput. Sci., 2017]. Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can be represented as Wheeler graphs, and that Wheeler graphs can be indexed in a way which is space efficient. Hence, being able to recognize whether a given graph is a Wheeler graph, or being able to approximate a given graph by a Wheeler graph, could have numerous applications in indexing. Here we resolve the open question of whether there exists an efficient algorithm for recognizing if a given graph is a Wheeler graph. We present: - The problem of recognizing whether a given graph G=(V,E) is a Wheeler graph is NP-complete for any edge label alphabet of size sigma >= 2, even when G is a DAG. This holds even on a restricted, subset of graphs called d-NFA\u27s for d >= 5. This is in contrast to recent results demonstrating the problem can be solved in polynomial time for d-NFA\u27s where d <= 2. We also show the recognition problem can be solved in linear time for sigma =1; - There exists an 2^{e log sigma + O(n + e)} time exact algorithm where n = |V| and e = |E|. This algorithm relies on graph isomorphism being computable in strictly sub-exponential time; - We define an optimization variant of the problem called Wheeler Graph Violation, abbreviated WGV, where the aim is to remove the minimum number of edges in order to obtain a Wheeler graph. We show WGV is APX-hard, even when G is a DAG, implying there exists a constant C >= 1 for which there is no C-approximation algorithm (unless P = NP). Also, conditioned on the Unique Games Conjecture, for all C >= 1, it is NP-hard to find a C-approximation; - We define the Wheeler Subgraph problem, abbreviated WS, where the aim is to find the largest subgraph which is a Wheeler Graph (the dual of the WGV). In contrast to WGV, we prove that the WS problem is in APX for sigma=O(1); The above findings suggest that most problems under this theme are computationally difficult. However, we identify a class of graphs for which the recognition problem is polynomial time solvable, raising the open question of which parameters determine this problem\u27s difficulty

    Superselectors: Efficient Constructions and Applications

    We introduce a new combinatorial structure: the superselector. We show that superselectors subsume several important combinatorial structures used in the past few years to solve problems in group testing, compressed sensing, multi-channel conflict resolution and data security. We prove close upper and lower bounds on the size of superselectors and we provide efficient algorithms for their constructions. Albeit our bounds are very general, when they are instantiated on the combinatorial structures that are particular cases of superselectors (e.g., (p,k,n)-selectors, (d,\ell)-list-disjunct matrices, MUT_k(r)-families, FUT(k, a)-families, etc.) they match the best known bounds in terms of size of the structures (the relevant parameter in the applications). For appropriate values of parameters, our results also provide the first efficient deterministic algorithms for the construction of such structures

    Space Efficient Encodings for Bit-strings, Range queries and Related Problems

    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. Srinivasa Rao Satti.In this thesis, we design and implement various space efficient data structures. Most of these structures use spaces close to the information-theoretic lower bound while supporting the queries efficiently. In particular, this thesis is concerned with the data structures for four problems: (i) supporting \rank{} and \select{} queries on compressed bit strings, (ii) nearest larger neighbor problem, (iii) simultaneous encodings for range and next/previous larger/smaller value queries, and (iv) range \topk{} queries on two-dimensional arrays. We first consider practical implementations of \emph{compressed} bitvectors, which support \rank{} and \select{} operations on a given bit-string, while storing the bit-string in compressed form~\cite{DBLP:conf/dcc/JoJORS14}. Our approach relies on \emph{variable-to-fixed} encodings of the bit-string, an approach that has not yet been considered systematically for practical encodings of bitvectors. We show that this approach leads to fast practical implementations with low \emph{redundancy} (i.e., the space used by the bitvector in addition to the compressed representation of the bit-string), and is a flexible and promising solution to the problem of supporting \rank{} and \select{} on moderately compressible bit-strings, such as those encountered in real-world applications. Next, we propose space-efficient data structures for the nearest larger neighbor problem~\cite{IWOCA2014,walcom-JoRS15}. Given a sequence of nn elements from a total order, and a position in the sequence, the nearest larger neighbor (\NLV{}) query returns the position of the element which is closest to the query position, and is larger than the element at the query position. The problem of finding all nearest larger neighbors has attracted interest due to its applications for parenthesis matching and in computational geometry~\cite{AsanoBK09,AsanoK13,BerkmanSV93}. We consider a data structure version of this problem, which is to preprocess a given sequence of elements to construct a data structure that can answer \NLN{} queries efficiently. For one-dimensional arrays, we give time-space tradeoffs for the problem on \textit{indexing model}. For two-dimensional arrays, we give an optimal encoding with constant query on \textit{encoding model}. We also propose space-efficient encodings which support various range queries, and previous and next smaller/larger value queries~\cite{cocoonJS15}. Given a sequence of nn elements from a total order, we obtain a 4.088n+o(n)4.088n + o(n)-bit encoding that supports all these queries where nn is the length of input array. For the case when we need to support all these queries in constant time, we give an encoding that takes 4.585n+o(n)4.585n + o(n) bits. This improves the 5.08n+o(n)5.08n+o(n)-bit encoding obtained by encoding the colored 2d2d-Min and 2d2d-Max heaps proposed by Fischer~\cite{Fischer11}. We extend the original DFUDS~\cite{BDMRRR05} encoding of the colored 2d2d-Min and 2d2d-Max heap that supports the queries in constant time. Then, we combine the extended DFUDS of 2d2d-Min heap and 2d2d-Max heap using the Min-Max encoding of Gawrychowski and Nicholson~\cite{Gawry14} with some modifications. We also obtain encodings that take lesser space and support a subset of these queries. Finally, we consider the various encodings that support range \topk{} queries on a two-dimensional array containing elements from a total order. For an m×nm \times n array, we first propose an optimal encoding for answering one-sided \topk{} queries, whose query range is restricted to [1m][1a][1 \dots m][1 \dots a], for 1an1 \le a \le n. Next, we propose an encoding for the general \topk{} queries that takes m2lg((k+1)nn)+mlgm+o(n)m^2\lg{{(k+1)n \choose n}} + m\lg{m}+o(n) bits. This generalizes the \topk{} encoding of Gawrychowski and Nicholson~\cite{Gawry14}.Chapter 1 Introduction 1 1.1 Computational model 2 1.1.1 Encoding and indexing models 2 1.2 Contribution of the thesis 3 1.3 Organization of the thesis 5 Chapter 2 Preliminaries 7 Chapter 3 Compressed bit vectors based on variable-to-fixed encodings 10 3.1 Introduction 10 3.2 Bit-vectors using V2F coding 14 3.3 V2F compression algorithms for bit-strings 16 3.3.1 Tunstall code 16 3.3.2 Enumerative codes 19 3.3.3 LZW algorithm 23 3.3.4 Empirical evaluation of the compressors 23 3.4 Practical implementation of bitvectors based on V2F compression. 26 3.4.1 Testing Methodology 29 3.4.2 Results of Empirical Evaluation 33 3.5 Future works 35 Chapter 4 Space Efficient Data Structures for Nearest Larger Neighbor 39 4.1 Introduction 39 4.2 Indexing NLV queries on 1D arrays 43 4.3 Encoding NLN queries on2D binary arrays 44 4.4 Encoding NLN queries for general 2D arrays 50 4.4.1 2D NLN in the encoding model–distinct case 50 4.4.2 2D NLN in the encoding model–general case 53 4.5 Open problems 63 Chapter 5 Simultaneous encodings for range and next/previous larger/smaller value queries 64 5.1 Introduction 64 5.2 Preliminaries 67 5.2.1 2d-Min heap 69 5.2.2 Encoding range min-max queries 72 5.3 Extended DFUDS for colored 2d-Min heap 75 5.4 Encoding colored 2d-Min and 2d-Max heaps 80 5.4.1 Combined data structure for DCMin(A) and DCMax(A) 82 5.4.2 Encoding colored 2d-Min and 2d-Max heaps using less space 88 5.5 Open problems 89 Chapter 6 Encoding Two-dimensional range Top-k queries 90 6.1 Introduction 90 6.2 Encoding one-sided range Top-k queries on 2D array 92 6.3 Encoding general range Top-k queries on 2D array 95 6.4 Open problems 99 Chapter 7 Conculsion 100 Bibliography 103 요약 112Docto