Search CORE

179 research outputs found

Efficient and effective KNN sequence search with approximate n-grams

Author: Arasu A.
Gravano L.
Li C.
Navarro G.
Yang Z.
Publication venue: 'VLDB Endowment'
Publication date
Field of study

A pivotal prefix based filtering algorithm for string similarity search

Author: Arasu A.
Gravano L.
Li C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

We study the string similarity search problem with edit-distance constraints, which, given a set of data strings and a query string, finds the similar strings to the query. Ex-isting algorithms use a signature-based framework. They first generate signatures for each string and then prune the dissimilar strings which have no common signatures to the query. However existing methods involve large numbers of signatures and many signatures are unnecessary. Reduc-ing the number of signatures not only increases the pruning power but also decreases the filtering cost. To address this problem, we propose a novel pivotal prefix filter which sig-nificantly reduces the number of signatures. We prove the pivotal filter achieves larger pruning power and less filter-ing cost than state-of-the-art filters. We develop a dynamic programming method to select high-quality pivotal prefix signatures to prune dissimilar strings with non-consecutive errors to the query. We propose an alignment filter that considers the alignments between signatures to prune large numbers of dissimilar pairs with consecutive errors to the query. Experimental results on three real datasets show that our method achieves high performance and outperforms the state-of-the-art methods by an order of magnitude

CiteSeerX

Crossref

Automatic Classification of Text Databases through Query Probing

Author: D. Hawking
D. Koller
D. Koller
J. P. Callan
J. Xu
L. Gravano
M. Perkowitz
S. Gauch
W. Meng
W. Meng
W. W. Cohen
W. W. Cohen
Publication venue
Publication date: 01/01/2000
Field of study

Many text databases on the web are "hidden" behind search interfaces, and their documents are only accessible through querying. Search engines typically ignore the contents of such search-only databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier's rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.Comment: 7 pages, 1 figur

arXiv.org e-Print Archive

CiteSeerX

Crossref

Columbia University Academic Commons

Efficient Similarity Join and Search on Multi-Attribute Data

Author: Dalvi N. N.
Garey M.
Gravano L.
Li C.
Michelson M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/12/2015
Field of study

In this paper we study similarity join and search on multi-attribute data. Traditional methods on single-attribute data have pruning power only on single attributes and cannot eciently support multi-attribute data. To address this problem, we propose a prefix tree index which has holis-tic pruning ability on multiple attributes. We propose a cost model to quantify the prefix tree which can guide the prefix tree construction. Based on the prefix tree, we devise a filter-verification framework to support similarity search and join on multi-attribute data. The filter step prunes a large number of dissimilar results and identifies some candi-dates using the prefix tree and the verification step verifies the candidates to generate the final answer. For similar-ity join, we prove that constructing an optimal prefix tree is NP-complete and develop a greedy algorithm to achieve high performance. For similarity search, since one prefix tree cannot support all possible search queries, we extend the cost model to support similarity search and devise a budget-based algorithm to construct multiple high-quality prefix trees. We also devise a hybrid verification algorithm to improve the verification step. Experimental results show our method significantly outperforms baseline approaches

CiteSeerX

Crossref

GlOSS

Author: Anthony Tomasic
BOWMAN C. M.
DANZIG P. B.
FLATER D. W.
GRAVANO L.
GRAVANO L.
Héctor García-Molina
KAHLE B.
Luis Gravano
NEUMAN B. C.
SCHWARTZ M. F.
SELBERG E.
SIMPSON P.
VOORHEES E. M.
YAN T. W.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Pigeonring: A Principle for Faster Thresholded Similarity Search

Author: Altschul S. F.
Andoni A.
Apostol T.
Arasu A.
Broder A. Z.
Christiani T.
Ciaccia P.
Daepp U.
Gionis A.
Gravano L.
Hwang Y.
Jégou H.
Kim S.
Li C.
Lv Q.
Mann W.
Meek C.
Qin J.
Razborov A. A.
Samet H.
Savasere A.
Tabei Y.
Tao T.
Wang J.
Weiss Y.
Yi B.
Publication venue: 'VLDB Endowment'
Publication date: 01/09/2018
Field of study

Crossref

Edinburgh Research Explorer

Structural responses of Ipomoea nil (L.) Roth 'Scarlet O'Hara' (Convolvulaceae) exposed to ozone

Author: Bárbara Bâesso Moura
Domingos M.
Edenise Segala Alves
Epstein E.
Fernandes A.J.
Ferreira M.L.
Gravano E.
Günthardt-Goerg M.S.
Günthardt-Goerg M.S.
Günthardt-Goerg M.S.
Klumpp A.
Krupa S.
Kubínová L.
Lersten N.R.
Niderman T.
Nouchi I.
Nouchi I.
Novak K.
O'Brien T.P.
Rao M.V.
Reig-Armiñana J.
Sandermann H.
Schraudner M.
Souza S.R.
Sílvia Ribeiro de Souza
Vollenweider P.
Vollenweider P.
Publication venue: 'FapUNIFESP (SciELO)'
Publication date
Field of study

Crossref

Collecting Profiling for Collection Fusion in Distributed Information Retrieval Systems

Author: J.P. Callan
J.P. Callan
L. Gravano
L. Gravano
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Discovering resource descriptions and merging results obtained from remote search engines are two key issues in distributed information retrieval studies. In uncooperative environments, query-based sampling and normalizing scores based merging strategies are well-known approaches to solve such problems. However, such approaches only consider the content of the remote database and do not consider the retrieval performance. In this paper, we address the problem that in peer to peer information systems and argue that the performance of search engine should also be considered. We also proposed a collection profiling strategy which can discover not only collection content but also retrieval performance. Web-based query classification and two collection fusion approaches based on the collection profiling are also introduced in this paper. Our experiments show that our merging strategies are effective in merging results on uncooperative environment

Crossref

Queensland University of Technology ePrints Archive

Efficient Semantically Equal Join on Strings

Author: E. Rahm
L. Gravano
L. Gravano
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Crossref

MESA

Author: Gravano L.
Publication venue: 'VLDB Endowment'
Publication date
Field of study

Crossref