16 research outputs found
b-Bit Minwise Hashing
This paper establishes the theoretical framework of b-bit minwise hashing.
The original minwise hashing method has become a standard technique for
estimating set similarity (e.g., resemblance) with applications in information
retrieval, data management, social networks and computational advertising.
By only storing the lowest bits of each (minwise) hashed value (e.g., b=1
or 2), one can gain substantial advantages in terms of computational efficiency
and storage space. We prove the basic theoretical results and provide an
unbiased estimator of the resemblance for any b. We demonstrate that, even in
the least favorable scenario, using b=1 may reduce the storage space at least
by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is
interested in resemblance > 0.5
Improving the Performance of SQL Join Operation in the Distributed Enterprise Information System by Caching
The enterprise information system (EIS) contains databases and other data sources in multiple data centers. Users query the EIS via clients. The client has a working space in the cloud. Caching data in client space will reduce the total execution time of the query. However, the client space has limited resources to store data. There are two options for caching data at the client space: caching the final results of query operations, or caching the source data tables. The problem is that some query operations such as “joining multiple big tables” will simply produce a result too big to store in cache in some cases. By contrast, caching source data tables may be a better choice in those situations. This paper presents an algorithm that combines active caching and passive caching to improve the cache hit, thus improving performance of the SQL join query in the cloud computing environment
Pb-Hash: Partitioned b-bit Hashing
Many hashing algorithms including minwise hashing (MinHash), one permutation
hashing (OPH), and consistent weighted sampling (CWS) generate integers of
bits. With hashes for each data vector, the storage would be
bits; and when used for large-scale learning, the model size would be
, which can be expensive. A standard strategy is to use only the
lowest bits out of the bits and somewhat increase , the number of
hashes. In this study, we propose to re-use the hashes by partitioning the
bits into chunks, e.g., . Correspondingly, the model size
becomes , which can be substantially smaller than the
original .
Our theoretical analysis reveals that by partitioning the hash values into
chunks, the accuracy would drop. In other words, using chunks of
bits would not be as accurate as directly using bits. This is due to the
correlation from re-using the same hash. On the other hand, our analysis also
shows that the accuracy would not drop much for (e.g.,) . In some
regions, Pb-Hash still works well even for much larger than 4. We expect
Pb-Hash would be a good addition to the family of hashing methods/applications
and benefit industrial practitioners.
We verify the effectiveness of Pb-Hash in machine learning tasks, for linear
SVM models as well as deep learning models. Since the hashed data are
essentially categorical (ID) features, we follow the standard practice of using
embedding tables for each hash. With Pb-Hash, we need to design an effective
strategy to combine embeddings. Our study provides an empirical evaluation
on four pooling schemes: concatenation, max pooling, mean pooling, and product
pooling. There is no definite answer which pooling would be always better and
we leave that for future study
Similarity Caching: Theory and Algorithms
This paper focuses on similarity caching systems, in which a user request for an object o that is not in the cache can be (partially) satisfied by a similar stored object o 0 , at the cost of a loss of user utility. Similarity caching systems can be effectively employed in several application areas, like multimedia retrieval, recommender systems, genome study, and machine learning training/serving. However, despite their relevance, the behavior of such systems is far from being well understood. In this paper, we provide a first comprehensive analysis of similarity caching in the offline, adversarial, and stochastic settings. We show that similarity caching raises significant new challenges, for which we propose the first dynamic policies with some optimality guarantees. We evaluate the performance of our schemes under both synthetic and real request traces
Cache-based query processing for search engines
Cataloged from PDF version of article.In practice, a search engine may fail to serve a query due to various reasons such as hardware/network failures, excessive query load, lack of matching documents, or service contract limitations (e.g., the query rate limits for third-party users of a search service). In this kind of scenarios, where the backend search system is unable to generate answers to queries, approximate answers can be generated by exploiting the previously computed query results available in the result cache of the search engine.In this work, we propose two alternative strategies to implement this cache-based query processing idea. The first strategy aggregates the results of similar queries that are previously cached in order to create synthetic results for new queries. The second strategy forms an inverted index over the textual information (i.e., query terms and result snippets) present in the result cache and uses this index to answer new queries. Both approaches achieve reasonable result qualities compared to processing queries with an inverted index built on the collection. © 2012 ACM