53 research outputs found

    Hardness of Bichromatic Closest Pair with Jaccard Similarity

    Consider collections A\mathcal{A} and B\mathcal{B} of red and blue sets, respectively. Bichromatic Closest Pair is the problem of finding a pair from A×B\mathcal{A}\times \mathcal{B} that has similarity higher than a given threshold according to some similarity measure. Our focus here is the classic Jaccard similarity ab/ab|\textbf{a}\cap \textbf{b}|/|\textbf{a}\cup \textbf{b}| for (a,b)A×B(\textbf{a},\textbf{b})\in \mathcal{A}\times \mathcal{B}. We consider the approximate version of the problem where we are given thresholds j1>j2j_1>j_2 and wish to return a pair from A×B\mathcal{A}\times \mathcal{B} that has Jaccard similarity higher than j2j_2 if there exists a pair in A×B\mathcal{A}\times \mathcal{B} with Jaccard similarity at least j1j_1. The classic locality sensitive hashing (LSH) algorithm of Indyk and Motwani (STOC '98), instantiated with the MinHash LSH function of Broder et al., solves this problem in O~(n2δ)\tilde O(n^{2-\delta}) time if j1j21δj_1\ge j_2^{1-\delta}. In particular, for δ=Ω(1)\delta=\Omega(1), the approximation ratio j1/j2=1/j2δj_1/j_2=1/j_2^{\delta} increases polynomially in 1/j21/j_2. In this paper we give a corresponding hardness result. Assuming the Orthogonal Vectors Conjecture (OVC), we show that there cannot be a general solution that solves the Bichromatic Closest Pair problem in O(n2Ω(1))O(n^{2-\Omega(1)}) time for j1/j2=1/j2o(1)j_1/j_2=1/j_2^{o(1)}. Specifically, assuming OVC, we prove that for any δ>0\delta>0 there exists an ε>0\varepsilon>0 such that Bichromatic Closest Pair with Jaccard similarity requires time Ω(n2δ)\Omega(n^{2-\delta}) for any choice of thresholds j2<j1<1δj_2<j_1<1-\delta, that satisfy j1j21εj_1\le j_2^{1-\varepsilon}

    Pseudorandom Hashing for Space-bounded Computation with Applications in Streaming

    We revisit Nisan's classical pseudorandom generator (PRG) for space-bounded computation (STOC 1990) and its applications in streaming algorithms. We describe a new generator, HashPRG, that can be thought of as a symmetric version of Nisan's generator over larger alphabets. Our generator allows a trade-off between seed length and the time needed to compute a given block of the generator's output. HashPRG can be used to obtain derandomizations with much better update time and \emph{without sacrificing space} for a large number of data stream algorithms, such as FpF_p estimation in the parameter regimes p>2p > 2 and 0<p<20 < p < 2 and CountSketch with tight estimation guarantees as analyzed by Minton and Price (SODA 2014) which assumed access to a random oracle. We also show a recent analysis of Private CountSketch can be derandomized using our techniques. For a dd-dimensional vector xx being updated in a turnstile stream, we show that x\|x\|_{\infty} can be estimated up to an additive error of εx2\varepsilon\|x\|_{2} using O(ε2log(1/ε)logd)O(\varepsilon^{-2}\log(1/\varepsilon)\log d) bits of space. Additionally, the update time of this algorithm is O(log1/ε)O(\log 1/\varepsilon) in the Word RAM model. We show that the space complexity of this algorithm is optimal up to constant factors. However, for vectors xx with x=Θ(x2)\|x\|_{\infty} = \Theta(\|x\|_{2}), we show that the lower bound can be broken by giving an algorithm that uses O(ε2logd)O(\varepsilon^{-2}\log d) bits of space which approximates x\|x\|_{\infty} up to an additive error of εx2\varepsilon\|x\|_{2}. We use our aforementioned derandomization of the CountSketch data structure to obtain this algorithm, and using the time-space trade off of HashPRG, we show that the update time of this algorithm is also O(log1/ε)O(\log 1/\varepsilon) in the Word RAM model.Comment: Minor writing improvement

    Udenfor museet - indenfor murene

    Artiklen fremstiller et aktionsforskningsinspireret formidlingsprojekt (Rødder) i Storstrøm statsfængsle, der gennem en serie prøvehandlinger eksperimenterer med kulturhistoriske formidling til fængslets indsatte. På baggrund af 5 kvalitative interview med indsatte diskuteres museets rolle som dannelsesaktør og hvordan viden om fortiden kan blive et fælles tredje i unge voksnes samtaler om identitet og tilknytning

    Triangle Counting in Dynamic Graph Streams

    Estimating the number of triangles in graph streams using a limited amount of memory has become a popular topic in the last decade. Different variations of the problem have been studied, depending on whether the graph edges are provided in an arbitrary order or as incidence lists. However, with a few exceptions, the algorithms have considered {\em insert-only} streams. We present a new algorithm estimating the number of triangles in {\em dynamic} graph streams where edges can be both inserted and deleted. We show that our algorithm achieves better time and space complexity than previous solutions for various graph classes, for example sparse graphs with a relatively small number of triangles. Also, for graphs with constant transitivity coefficient, a common situation in real graphs, this is the first algorithm achieving constant processing time per edge. The result is achieved by a novel approach combining sampling of vertex triples and sparsification of the input graph. In the course of the analysis of the algorithm we present a lower bound on the number of pairwise independent 2-paths in general graphs which might be of independent interest. At the end of the paper we discuss lower bounds on the space complexity of triangle counting algorithms that make no assumptions on the structure of the graph.Comment: New version of a SWAT 2014 paper with improved result