6 research outputs found

    Minwise-Independent Permutations with Insertion and Deletion of Features

    Full text link
    In their seminal work, Broder \textit{et. al.}~\citep{BroderCFM98} introduces the minHash\mathrm{minHash} algorithm that computes a low-dimensional sketch of high-dimensional binary data that closely approximates pairwise Jaccard similarity. Since its invention, minHash\mathrm{minHash} has been commonly used by practitioners in various big data applications. Further, the data is dynamic in many real-life scenarios, and their feature sets evolve over time. We consider the case when features are dynamically inserted and deleted in the dataset. We note that a naive solution to this problem is to repeatedly recompute minHash\mathrm{minHash} with respect to the updated dimension. However, this is an expensive task as it requires generating fresh random permutations. To the best of our knowledge, no systematic study of minHash\mathrm{minHash} is recorded in the context of dynamic insertion and deletion of features. In this work, we initiate this study and suggest algorithms that make the minHash\mathrm{minHash} sketches adaptable to the dynamic insertion and deletion of features. We show a rigorous theoretical analysis of our algorithms and complement it with extensive experiments on several real-world datasets. Empirically we observe a significant speed-up in the running time while simultaneously offering comparable performance with respect to running minHash\mathrm{minHash} from scratch. Our proposal is efficient, accurate, and easy to implement in practice

    Exploiting the Computational Power of Ternary Content Addressable Memory

    Get PDF
    Ternary Content Addressable Memory or in short TCAM is a special type of memory that can execute a certain set of operations in parallel on all of its words. Because of power consumption and relatively small storage capacity, it has only been used in special environments. Over the past few years its cost has been reduced and its storage capacity has increased signifi cantly and these exponential trends are continuing. Hence it can be used in more general environments for larger problems. In this research we study how to exploit its computational power in order to speed up fundamental problems and needless to say that we barely scratched the surface. The main problems that has been addressed in our research are namely Boolean matrix multiplication, approximate subset queries using bloom filters, Fixed universe priority queues and network flow classi cation. For Boolean matrix multiplication our simple algorithm has a run time of O (d(N^2)/w) where N is the size of the square matrices, w is the number of bits in each word of TCAM and d is the maximum number of ones in a row of one of the matrices. For the Fixed universe priority queue problems we propose two data structures one with constant time complexity and space of O((1/ε)n(U^ε)) and the other one in linear space and amortized time complexity of O((lg lg U)/(lg lg lg U)) which beats the best possible data structure in the RAM model namely Y-fast trees. Considering each word of TCAM as a bloom filter, we modify the hash functions of the bloom filter and propose a data structure which can use the information capacity of each word of TCAM more efi ciently by using the co-occurrence probability of possible members. And finally in the last chapter we propose a novel technique for network flow classi fication using TCAM

    On Restricted Min-Wise Independence of Permutations

    No full text
    A family of permutations Sn with a probability distribution on it is called k-restricted min-wise independent if we have Pr[min #(X) = #(x)] = for every subset X |X | # k, every x X , and # chosen at random. We present a simple proof of a result of Norin: every such family has size at least . Some features of our method might be of independent interest

    On Restricted Min-Wise Independence of Permutations

    No full text
    A family of permutations F Sn with a probability distribution on it is called k-restricted min-wise independent if we have Pr[min (X) = (x)] = jXj for every subset X [n] with jX j k, every x 2 X , and 2 F chosen at random. We present a simple proof of a result of Norin: every such family has size at least 2 c . Some features of our method might be of independent interest
    corecore