Search CORE

16 research outputs found

b-Bit Minwise Hashing

Author: Konig Arnd Christian
Li Ping
Publication venue
Publication date: 17/10/2009
Field of study

This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, social networks and computational advertising. By only storing the lowest

b

bits of each (minwise) hashed value (e.g., b=1 or 2), one can gain substantial advantages in terms of computational efficiency and storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is interested in resemblance > 0.5

arXiv.org e-Print Archive

CiteSeerX

Improving the Performance of SQL Join Operation in the Distributed Enterprise Information System by Caching

Author: Qu Yanzhen
Yang Weiwen
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/05/2012
Field of study

The enterprise information system (EIS) contains databases and other data sources in multiple data centers. Users query the EIS via clients. The client has a working space in the cloud. Caching data in client space will reduce the total execution time of the query. However, the client space has limited resources to store data. There are two options for caching data at the client space: caching the final results of query operations, or caching the source data tables. The problem is that some query operations such as “joining multiple big tables” will simply produce a result too big to store in cache in some cases. By contrast, caching source data tables may be a better choice in those situations. This paper presents an algorithm that combines active caching and passive caching to improve the cache hit, thus improving performance of the SQL join query in the cloud computing environment

AIS Electronic Library (AISeL)

Pb-Hash: Partitioned b-bit Hashing

Author: Li Ping
Zhao Weijie
Publication venue
Publication date: 28/06/2023
Field of study

Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of

B

bits. With

k

hashes for each data vector, the storage would be

B\times k

bits; and when used for large-scale learning, the model size would be

2^B\times k

, which can be expensive. A standard strategy is to use only the lowest

b

bits out of the

B

bits and somewhat increase

k

, the number of hashes. In this study, we propose to re-use the hashes by partitioning the

B

bits into

m

chunks, e.g.,

b\times m =B

. Correspondingly, the model size becomes

m\times 2^b \times k

, which can be substantially smaller than the original

2^B\times k

. Our theoretical analysis reveals that by partitioning the hash values into

m

chunks, the accuracy would drop. In other words, using

m

chunks of

B/m

bits would not be as accurate as directly using

B

bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,)

m=2\sim 4

. In some regions, Pb-Hash still works well even for

m

much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine

m

embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study

arXiv.org e-Print Archive

Similarity Caching: Theory and Algorithms

Author: Garetto Michele
Leonardi Emilio
Neglia Giovanni
Publication venue: IEEE
Publication date: 01/01/2020
Field of study

This paper focuses on similarity caching systems, in which a user request for an object o that is not in the cache can be (partially) satisfied by a similar stored object o 0 , at the cost of a loss of user utility. Similarity caching systems can be effectively employed in several application areas, like multimedia retrieval, recommender systems, genome study, and machine learning training/serving. However, despite their relevance, the behavior of such systems is far from being well understood. In this paper, we provide a first comprehensive analysis of similarity caching in the offline, adversarial, and stochastic settings. We show that similarity caching raises significant new challenges, for which we propose the first dynamic policies with some optimality guarantees. We evaluate the performance of our schemes under both synthetic and real request traces

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Institutional Research Information System University of Turin

Cache-based query processing for search engines

Author: Cambazoglu B. B.
Ozcan R.
Sengor Altingovde I.
Ulusoy O.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/11/2012
Field of study

Cataloged from PDF version of article.In practice, a search engine may fail to serve a query due to various reasons such as hardware/network failures, excessive query load, lack of matching documents, or service contract limitations (e.g., the query rate limits for third-party users of a search service). In this kind of scenarios, where the backend search system is unable to generate answers to queries, approximate answers can be generated by exploiting the previously computed query results available in the result cache of the search engine.In this work, we propose two alternative strategies to implement this cache-based query processing idea. The first strategy aggregates the results of similar queries that are previously cached in order to create synthetic results for new queries. The second strategy forms an inverted index over the textual information (i.e., query terms and result snippets) present in the result cache and uses this index to answer new queries. Both approaches achieve reasonable result qualities compared to processing queries with an inverted index built on the collection. © 2012 ACM

Bilkent University Institutional Repository

OpenMETU (Middle East Technical University)