Search CORE

14,326 research outputs found

Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases

Author: Baik Christopher
Jagadish H. V.
Li Yunyao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/01/2019
Field of study

A critical challenge in constructing a natural language interface to database (NLIDB) is bridging the semantic gap between a natural language query (NLQ) and the underlying data. Two specific ways this challenge exhibits itself is through keyword mapping and join path inference. Keyword mapping is the task of mapping individual keywords in the original NLQ to database elements (such as relations, attributes or values). It is challenging due to the ambiguity in mapping the user's mental model and diction to the schema definition and contents of the underlying database. Join path inference is the process of selecting the relations and join conditions in the FROM clause of the final SQL query, and is difficult because NLIDB users lack the knowledge of the database schema or SQL and therefore cannot explicitly specify the intermediate tables and joins needed to construct a final SQL query. In this paper, we propose leveraging information from the SQL query log of a database to enhance the performance of existing NLIDBs with respect to these challenges. We present a system Templar that can be used to augment existing NLIDBs. Our extensive experimental evaluation demonstrates the effectiveness of our approach, leading up to 138% improvement in top-1 accuracy in existing NLIDBs by leveraging SQL query log information.Comment: Accepted to IEEE International Conference on Data Engineering (ICDE) 201

arXiv.org e-Print Archive

Crossref

Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture

Author: Dalton Jeff
Li Zhenghua
Lin Jimmy
Mishne Gilad
Sharma Aneesh
Publication venue
Publication date: 27/10/2012
Field of study

We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data

arXiv.org e-Print Archive

CiteSeerX

Multi-View Active Learning in the Non-Realizable Case

Author: Wang Wei
Zhou Zhi-Hua
Publication venue
Publication date: 01/01/2010
Field of study

The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be

\widetilde{O}(\log\frac{1}{\epsilon})

, contrasting to single-view setting where the polynomial improvement is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is

\widetilde{O}(\frac{1}{\epsilon})

, where the order of

1/\epsilon

is independent of the parameter in Tsybakov noise, contrasting to previous polynomial bounds where the order of

1/\epsilon

is related to the parameter in Tsybakov noise.Comment: 22 pages, 1 figur

arXiv.org e-Print Archive

CiteSeerX

Fast Exact Search in Hamming Space with Multi-Index Hashing

Author: Fleet David J.
Norouzi Mohammad
Punjani Ali
Publication venue
Publication date: 24/04/2014
Field of study

There is growing interest in representing image data and feature descriptors using compact binary codes for fast near neighbor search. Although binary codes are motivated by their use as direct indices (addresses) into a hash table, codes longer than 32 bits are not being used as such, as it was thought to be ineffective. We introduce a rigorous way to build multiple hash tables on binary code substrings that enables exact k-nearest neighbor search in Hamming space. The approach is storage efficient and straightforward to implement. Theoretical analysis shows that the algorithm exhibits sub-linear run-time behavior for uniformly distributed codes. Empirical results show dramatic speedups over a linear scan baseline for datasets of up to one billion codes of 64, 128, or 256 bits

arXiv.org e-Print Archive

CiteSeerX

Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space

Author: Har-Peled Sariel
Kumar Nirman
Publication venue
Publication date: 01/12/2012
Field of study

For a set of

n

points in

\Re^d

, and parameters

k

and \eps, we present a data structure that answers (1+\eps,k)-\ANN queries in logarithmic time. Surprisingly, the space used by the data-structure is \Otilde (n /k); that is, the space used is sublinear in the input size if

k

is sufficiently large. Our approach provides a novel way to summarize geometric data, such that meaningful proximity queries on the data can be carried out using this sketch. Using this, we provide a sublinear space data-structure that can estimate the density of a point set under various measures, including: \begin{inparaenum}[(i)] \item sum of distances of

k

closest points to the query point, and \item sum of squared distances of

k

closest points to the query point. \end{inparaenum} Our approach generalizes to other distance based estimation of densities of similar flavor. We also study the problem of approximating some of these quantities when using sampling. In particular, we show that a sample of size \Otilde (n /k) is sufficient, in some restricted cases, to estimate the above quantities. Remarkably, the sample size has only linear dependency on the dimension

arXiv.org e-Print Archive

University of Memphis Digital Commons

CiteSeerX

Spectral Approaches to Nearest Neighbor Search

Author: Abdullah Amirali
Andoni Alexandr
Kannan Ravindran
Krauthgamer Robert
Publication venue
Publication date: 04/08/2014
Field of study

We study spectral algorithms for the high-dimensional Nearest Neighbor Search problem (NNS). In particular, we consider a semi-random setting where a dataset

P

\mathbb{R}^d

is chosen arbitrarily from an unknown subspace of low dimension

k\ll d

, and then perturbed by fully

d

-dimensional Gaussian noise. We design spectral NNS algorithms whose query time depends polynomially on

d

and

\log n

(where

n=|P|

) for large ranges of

k

d

and

n

. Our algorithms use a repeated computation of the top PCA vector/subspace, and are effective even when the random-noise magnitude is {\em much larger} than the interpoint distances in

P

. Our motivation is that in practice, a number of spectral NNS algorithms outperform the random-projection methods that seem otherwise theoretically optimal on worst case datasets. In this paper we aim to provide theoretical justification for this disparity.Comment: Accepted in the proceedings of FOCS 2014. 30 pages and 4 figure

arXiv.org e-Print Archive

CiteSeerX