2,212 research outputs found

    Pb-Hash: Partitioned b-bit Hashing

    Full text link
    Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of BB bits. With kk hashes for each data vector, the storage would be B×kB\times k bits; and when used for large-scale learning, the model size would be 2B×k2^B\times k, which can be expensive. A standard strategy is to use only the lowest bb bits out of the BB bits and somewhat increase kk, the number of hashes. In this study, we propose to re-use the hashes by partitioning the BB bits into mm chunks, e.g., b×m=Bb\times m =B. Correspondingly, the model size becomes m×2b×km\times 2^b \times k, which can be substantially smaller than the original 2B×k2^B\times k. Our theoretical analysis reveals that by partitioning the hash values into mm chunks, the accuracy would drop. In other words, using mm chunks of B/mB/m bits would not be as accurate as directly using BB bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,) m=2∼4m=2\sim 4. In some regions, Pb-Hash still works well even for mm much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine mm embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study

    Building K-Anonymous User Cohorts with\\ Consecutive Consistent Weighted Sampling (CCWS)

    Full text link
    To retrieve personalized campaigns and creatives while protecting user privacy, digital advertising is shifting from member-based identity to cohort-based identity. Under such identity regime, an accurate and efficient cohort building algorithm is desired to group users with similar characteristics. In this paper, we propose a scalable KK-anonymous cohort building algorithm called {\em consecutive consistent weighted sampling} (CCWS). The proposed method combines the spirit of the (pp-powered) consistent weighted sampling and hierarchical clustering, so that the KK-anonymity is ensured by enforcing a lower bound on the size of cohorts. Evaluations on a LinkedIn dataset consisting of >70>70M users and ads campaigns demonstrate that CCWS achieves substantial improvements over several hashing-based methods including sign random projections (SignRP), minwise hashing (MinHash), as well as the vanilla CWS
    • …
    corecore