Search CORE

39 research outputs found

Approximate String Joins in a Database (Almost) for Free -- Erratum

Author: Gravano Luis
Ipeirotis Panagiotis G.
Jagadish H. V.
Koudas Nick
Muthukrishnan S.
Srivastava Divesh
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2003
Field of study

In [GIJ+01a, GIJ+01b] we described how to use q-grams in an RDBMS to perform approximate string joins. We also showed how to implement the approximate join using plain SQL queries. Specifically, we described three filters, count filter, position filter, and length filter, which can be used to execute efficiently the approximate join. The intuition behind the count filter was that strings that are similar have many q-grams in common. In particular, two strings s1 and s2 can have up to max max {|s1|, |s2|} + q - 1 common q-grams. When s1 = s2, they have exactly that many q-grams in common. When s1 and s2 are within edit distance k, they share at least (max {|s1|, |s2|} + q - 1) - kq q-grams, since kq is the maximum numbers of q-grams that can be affected by k edit distance operations

CiteSeerX

Columbia University Academic Commons

Dense subgraph maintenance under streaming edge weight updates for real-time story identification

Author: A Angel
Albert Angel
Divesh Srivastava
EL Lawler
J Hartline
Michael Svendsen
Nick Koudas
Nikos Sarkas
PM Pardalos
S Guha
S Hill
S Kumar
Srikanta Tirthapura
T Uno
V Stix
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2014
Field of study

Recent years have witnessed an unprecedented proliferation of social media. People around the globe author, everyday, millions of blog posts, social network status updates, etc. This rich stream of information can be used to identify, on an ongoing basis, emerging stories, and events that capture popular attention. Stories can be identified via groups of tightly coupled real-world entities, namely the people, locations, products, etc, that are involved in the story. The sheer scale and rapid evolution of the data involved necessitate highly efficient techniques for identifying important stories at every point of time. The main challenge in real-time story identification is the maintenance of dense subgraphs (corresponding to groups of tightly coupled entities) under streaming edge weight updates (resulting from a stream of user-generated content). This is the first work to study the efficient maintenance of dense subgraphs under such streaming edge weight updates. For a wide range of definitions of density, we derive theoretical results regarding the magnitude of change that a single edge weight update can cause. Based on these, we propose a novel algorithm, DynDens, which outperforms adaptations of existing techniques to this setting and yields meaningful, intuitive results. Our approach is validated by a thorough experimental evaluation on large-scale real and synthetic datasets

Digital Repository @ Iowa State University (ISU)

Crossref

Object-based and image-based object representations

Author: Abel D. J.
Aggarwal C.
Ang C. H.
Aref W. G.
Aref W. G.
Aref W. G.
Aref W. G.
Arge L.
Arge L.
Baumgart B. G.
Becker B.
Beckmann N.
Bell S. B. M.
Berchtold S.
Brabec F.
Brinkhoff T.
Brodsky A.
Burt P. J.
Burton F. W.
Chakrabarti K.
Chakrabarti K.
Chen L.
Choubey R.
de Berg M.
DeWitt D. J.
Dittrich J.-P.
Dori D.
Douglas D. H.
Douglas D. H.
Dyer C. R.
Esperança C.
Faloutsos C.
Faloutsos C.
Finkel R. A.
Foley J. D.
Franklin W. R.
Freeston M.
Gaede V.
Garcia Y. J.
García Y. J.
García Y. J.
Gottschalk S.
Greene D.
Guttman A.
Günther O.
Günther O.
Hanan Samet
Hellerstein J. M.
Henrich A.
Henrich A.
Hilbert D.
Hoel E. G.
Hoel E. G.
Ichikawa T.
Jagadish H. V.
Jagadish H. V.
Joy K. I.
Kamel I.
Kamel I.
Katayama N.
Klinger A.
Knowlton K.
Koudas N.
Kriegel H.-P.
Leutenegger S. T.
Liu X.
Lo M.-L.
Lo M.-L.
Meagher D.
Miller R.
Moitra A.
O'Rourke J.
Orenstein J. A.
Ottmann T.
Patel J. M.
Peano G.
Preparata F. P.
Robinson J. T.
Rosenfeld A.
Ross K. A.
Roussopolos N.
Roussopoulos N.
Saalfeld A.
Sakurai Y.
Schiwietz M.
Schrack G.
Sellis T.
Shamos M. I.
Shekhar S.
Sloan
Solntseff N.
Srihari S. N.
Stonebraker M.
Tanimoto S. L.
Theodoridis Y.
Tropf H.
van den Bercken J.
van den Bercken J.
van Oosterom P.
Wang W.
Wang W.
White D. A.
White D. A.
White M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

On Effective Multi-Dimensional Indexing for Strings

Author: Divesh Srivastava
H. V. Jagadish
Nick Koudas
Publication venue
Publication date: 01/01/2000
Field of study

As databases have expanded in scope from storing purely business data to include XML documents, product catalogs, e-mail messages, and directory data, it has become increasingly important to search databases based on wild-card string matching: prefix matching, for example, is more common (and useful) than exact matching, for such data. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions. Traditional multi-dimensional index structures, designed with (fixed length) numeric data in mind, are not suitable for matching unbounded length string data. In this paper, we describe a general technique for adapting a multi-dimensional index structure for wild-card indexing of unbounded length string data. The key ideas are (a) a carefully developed mapping function from strings to rational numbers, (b) representing an unbounded length string in an index leaf page by a fixed length offset to an external key, and (c) storing multiple elided t..

CiteSeerX

Crossref

Efficient k-Nearest Neighbor Searches for Parallel Multidimensional Index Structures

Author: D. Taniar
J. An
M.H. Ali
N. Koudas
V. Gaede
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Crossref

Choosing Bucket Boundaries for Histograms

Author: H. V. Jagadish
Kenneth C. Sevcik
Nick Koudas
Publication venue
Publication date
Field of study

Histograms have long been used to capture attribute value distribution statistics for query optimizers. More recently, there has been a growing interest in the use of histograms to produce quick approximate answers to decision support queries. This motivates finding good strategies for specifying histogram buckets. Under the assumption that finding optimal bucket boundaries is computationally inefficient, previous research has focused on finding heuristics that produce good solutions. In this paper, we present an algorithm to determine bucket boundaries optimally, in time proportional to the square of the number of distinct data values, for a broad class of optimality metrics. Through experimentation, we show that optimal histograms can have substantially lower reconstruction error than histograms produced according to popular heuristics. We also present a new heuristic, based on our understanding of the optimal solution, which in many cases obtains lower reconstruction error than prev..

CiteSeerX

Optimal Histograms with Quality Guarantees

Author: H. V. Jagadish
Ken Sevcik
Nick Koudas
Viswanat H Poosala
Publication venue
Publication date
Field of study

Histograms are commonly used to capture attribute value distribution statistics for query optimizers. More recently, histograms have also been considered as a way to produce quick approximate answers t

CiteSeerX

Effective Explanations for Entity Resolution Models

Author: Firmani D.
Koudas N.
Martello V.
Merialdo P.
Srivastava D.
Teofili T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Entity resolution (ER) aims at matching records that refer to the same real-world entity. Although widely studied for the last 50 years, ER still represents a challenging data management problem, and several recent works have started to investigate the opportunity of applying deep learning (DL) techniques to solve this problem. In this paper, we study the fundamental problem of explainability of the DL solution for ER. Understanding the matching predictions of an ER solution is indeed crucial to assess the trustworthiness of the DL model and to discover its biases. We treat the DL model as a black box classifier and - while previous approaches to provide explanations for DL predictions are agnostic to the classification task - we propose the CERTA approach that is aware of the semantics of the ER problem. Our approach produces both saliency explanations, which associate each attribute with a saliency score, and counterfactual explanations, which provide examples of values that can flip the prediction. CERTA builds on a probabilistic framework that aims at computing the explanations evaluating the outcomes produced by using perturbed copies of the input records. We experimentally evaluate CERTA'S explanations of state-of-the-art ER solutions based on DL models using publicly available datasets, and demonstrate the effectiveness of CERTA over recently proposed methods for this problem

Archivio della Ricerca - Università di Roma 3