Search CORE

1,500 research outputs found

Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors

Author: Andoni Alexandr
Klein Philip N.
Laarhoven Thijs
Razenshteyn Ilya
Waingarten Erik
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/01/2016
Field of study

[See the paper for the full abstract.] We show tight upper and lower bounds for time-space trade-offs for the

c

-Approximate Near Neighbor Search problem. For the

d

-dimensional Euclidean space and

n

-point datasets, we develop a data structure with space

n^{1 + \rho_u + o(1)} + O(dn)

and query time

n^{\rho_q + o(1)} + d n^{o(1)}

for every

\rho_u, \rho_q \geq 0

such that: \begin{equation} c^2 \sqrt{\rho_q} + (c^2 - 1) \sqrt{\rho_u} = \sqrt{2c^2 - 1}. \end{equation} This is the first data structure that achieves sublinear query time and near-linear space for every approximation factor

c > 1

, improving upon [Kapralov, PODS 2015]. The data structure is a culmination of a long line of work on the problem for all space regimes; it builds on Spherical Locality-Sensitive Filtering [Becker, Ducas, Gama, Laarhoven, SODA 2016] and data-dependent hashing [Andoni, Indyk, Nguyen, Razenshteyn, SODA 2014] [Andoni, Razenshteyn, STOC 2015]. Our matching lower bounds are of two types: conditional and unconditional. First, we prove tightness of the whole above trade-off in a restricted model of computation, which captures all known hashing-based approaches. We then show unconditional cell-probe lower bounds for one and two probes that match the above trade-off for

\rho_q = 0

, improving upon the best known lower bounds from [Panigrahy, Talwar, Wieder, FOCS 2010]. In particular, this is the first space lower bound (for any static data structure) for two probes which is not polynomially smaller than the one-probe bound. To show the result for two probes, we establish and exploit a connection to locally-decodable codes.Comment: 62 pages, 5 figures; a merger of arXiv:1511.07527 [cs.DS] and arXiv:1605.02701 [cs.DS], which subsumes both of the preprints. New version contains more elaborated proofs and fixed some typo

arXiv.org e-Print Archive

Repository TU/e

Crossref

Pure OAI Repository

Robust parent-identifying codes and combinatorial arrays

Author: Alexander Barg
Grigory Kabatiansky
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 12/05/2011
Field of study

n

-word

y

over a finite alphabet of cardinality

q

is called a descendant of a set of

t

words

x^1,\dots,x^t

y_i\in\{x^1_i,\dots,x^t_i\}

for all

i=1,\dots,n.

A code \cC=\{x^1,\dots,x^M\} is said to have the

t

-IPP property if for any

n

-word

y

that is a descendant of at most

t

parents belonging to the code it is possible to identify at least one of them. From earlier works it is known that

t

-IPP codes of positive rate exist if and only if

t\le q-1

. We introduce a robust version of IPP codes which allows {unconditional} identification of parents even if some of the coordinates in

y

can break away from the descent rule, i.e., can take arbitrary values from the alphabet, or become completely unreadable. We show existence of robust

t

-IPP codes for all

t\le q-1

and some positive proportion of such coordinates. The proofs involve relations between IPP codes and combinatorial arrays with separating properties such as perfect hash functions and hash codes, partially hashing families and separating codes. For

t=2

we find the exact proportion of mutant coordinates (for several error scenarios) that permits unconditional identification of parents

Cryptology ePrint Archive

Syntax tree fingerprinting: a foundation for source code similarity detection

Author: Chilowicz Michel
Duris Étienne
Roussel Gilles
Publication venue: HAL CCSD
Publication date: 01/01/2009
Field of study

Plagiarism detection and clone refactoring in software depend on one common concern: nding similar source chunks across large repositories. However, since code duplication in software is often the result of copy-paste behaviors, only minor modi cations are expected between shared codes. On the contrary, in a plagiarism detection context, edits are more extensive and exact matching strategies show their limits. Among the three main representations used by source code similarity detection tools, namely the linear token sequences, the Abstract Syntax Tree (AST) and the Program Depen- dency Graph (PDG), we believe that the AST could e ciently support the program analysis and transformations required for the advanced similarity detection process. In this paper we present a simple and scalable architecture based on syntax tree nger- printing. Thanks to a study of several hashing strategies reducing false-positive collisions, we propose a framework that e ciently indexes AST representations in a database, that quickly detects exact (w.r.t source code abstraction) clone clusters and that easily retrieves their corresponding ASTs. Our aim is to allow further processing of neighboring exact matches in order to identify the larger approximate matches, dealing with the common modi cation patterns seen in the intra-project copy-pastes and in the plagiarism cases

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Dynamic Ordered Sets with Exponential Search Trees

Author: Andersson Arne
Thorup Mikkel
Publication venue
Publication date: 01/01/2002
Field of study

We introduce exponential search trees as a novel technique for converting static polynomial space search structures for ordered sets into fully-dynamic linear space data structures. This leads to an optimal bound of O(sqrt(log n/loglog n)) for searching and updating a dynamic set of n integer keys in linear space. Here searching an integer y means finding the maximum key in the set which is smaller than or equal to y. This problem is equivalent to the standard text book problem of maintaining an ordered set (see, e.g., Cormen, Leiserson, Rivest, and Stein: Introduction to Algorithms, 2nd ed., MIT Press, 2001). The best previous deterministic linear space bound was O(log n/loglog n) due Fredman and Willard from STOC 1990. No better deterministic search bound was known using polynomial space. We also get the following worst-case linear space trade-offs between the number n, the word length w, and the maximal key U < 2^w: O(min{loglog n+log n/log w, (loglog n)(loglog U)/(logloglog U)}). These trade-offs are, however, not likely to be optimal. Our results are generalized to finger searching and string searching, providing optimal results for both in terms of n.Comment: Revision corrects some typoes and state things better for applications in subsequent paper

arXiv.org e-Print Archive

CiteSeerX

Compartmentalized Connection Graphs for Concurrent Logic Programming II : Parallelism, Indexing and Unification

Author: Powers David M. W.
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/1990
Field of study

This report continues to document the development of a logic programming paradigm with implicit control, based in a compartmentalized connection graph theorem prover. Whilst the research has as it main goal the development of a language in which programs can be written with much less explicit control than PROLOG and its existing successors, a secondary goal is to exploit the immense parallelism inherent in the connection graph. The focus of this paper is the documentation of the extent of the parallelism inherent in the proof procedure. We characterize six different forms of parallelism These various forms of parallelism can be further classiﬁed into two classes: those associated with the performance of resolution steps, and those which are more concerned with uniﬁcation. Unification is thus also a major topic of this report. In the ﬁrst report of this series uniﬁcation was identiﬁed as a major source of the cost of executing a logic program, or of proving a theorem. It turns out that deferring uniﬁcation is the one of the best ways of dealing with it: hashing to perform it, and indexing to avoid it. Indexing and hashing, therefore, is the third topic covered in this report

Universaar

Acronym

Similarity learning for person re-identification and semantic video retrieval

Author: Chen Yuting
Publication venue
Publication date: 10/07/2017
Field of study

Many computer vision problems boil down to the learning of a good visual similarity function that calculates a score of how likely two instances share the same semantic concept. In this thesis, we focus on two problems related to similarity learning: Person Re-Identification, and Semantic Video Retrieval. Person Re-Identification aims to maintain the identity of an individual in diverse locations through different non-overlapping camera views. Starting with two cameras, we propose a novel visual word co-occurrence based appearance model to measure the similarities between pedestrian images. This model naturally accounts for spatial similarities and variations caused by pose, illumination and configuration changes across camera views. As a generalization to multiple camera views, we introduce the Group Membership Prediction (GMP) problem. The GMP problem involves predicting whether a collection of instances shares the same semantic property. In this context, we propose a novel probability model and introduce latent view-specific and view-shared random variables to jointly account for the view-specific appearance and cross-view similarities among data instances. Our method is tested on various benchmarks demonstrating superior accuracy over state-of-art. Semantic Video Retrieval seeks to match complex activities in a surveillance video to user described queries. In surveillance scenarios with noise and clutter usually present, visual uncertainties introduced by error-prone low-level detectors, classifiers and trackers compose a significant part of the semantic gap between user defined queries and the archive video. To bridge the gap, we propose a novel probabilistic activity localization formulation that incorporates learning of object attributes, between-object relationships, and object re-identification without activity-level training data. Our experiments demonstrate that the introduction of similarity learning components effectively compensate for noise and error in previous stages, and result in preferable performance on both aerial and ground surveillance videos. Considering the computational complexity of our similarity learning models, we attempt to develop a way of training complicated models efficiently while remaining good performance. As a proof-of-concept, we propose training deep neural networks for supervised learning of hash codes. With slight changes in the optimization formulation, we could explore the possibilities of incorporating the training framework for Person Re-Identification and related problems.2019-07-09T00:00:00

Boston University Institutional Repository (OpenBU)