Search CORE

8,998 research outputs found

Toward Attribute Efficient Learning Algorithms

Author: Klivans Adam R.
Servedio Rocco A.
Publication venue
Publication date: 27/11/2003
Field of study

We make progress on two important problems regarding attribute efficient learnability. First, we give an algorithm for learning decision lists of length

k

over

n

variables using

2^{\tilde{O}(k^{1/3})} \log n

examples and time

n^{\tilde{O}(k^{1/3})}

. This is the first algorithm for learning decision lists that has both subexponential sample complexity and subexponential running time in the relevant parameters. Our approach establishes a relationship between attribute efficient learning and polynomial threshold functions and is based on a new construction of low degree, low weight polynomial threshold functions for decision lists. For a wide range of parameters our construction matches a 1994 lower bound due to Beigel for the ODDMAXBIT predicate and gives an essentially optimal tradeoff between polynomial threshold function degree and weight. Second, we give an algorithm for learning an unknown parity function on

k

out of

n

variables using

O(n^{1-1/k})

examples in time polynomial in

n

. For

k=o(\log n)

this yields a polynomial time algorithm with sample complexity

o(n)

. This is the first polynomial time algorithm for learning parity on a superconstant number of variables with sublinear sample complexity

arXiv.org e-Print Archive

On learning k-parities with and without noise

Author: Bhattacharyya Arnab
Gadekar Ameet
Rajgopal Ninad
Publication venue
Publication date: 18/02/2015
Field of study

We first consider the problem of learning

k

-parities in the on-line mistake-bound model: given a hidden vector

x \in \{0,1\}^n

with

|x|=k

and a sequence of "questions"

a_1, a_2, ...\in \{0,1\}^n

, where the algorithm must reply to each question with

\pmod 2

, what is the best tradeoff between the number of mistakes made by the algorithm and its time complexity? We improve the previous best result of Buhrman et al. by an

\exp(k)

factor in the time complexity. Second, we consider the problem of learning

k

-parities in the presence of classification noise of rate

\eta \in (0,1/2)

. A polynomial time algorithm for this problem (when

\eta > 0

and

k = \omega(1)

) is a longstanding challenge in learning theory. Grigorescu et al. showed an algorithm running in time

{n \choose k/2}^{1 + 4\eta^2 +o(1)}

. Note that this algorithm inherently requires time

{n \choose k/2}

even when the noise rate

\eta

is polynomially small. We observe that for sufficiently small noise rate, it is possible to break the

n \choose k/2

barrier. In particular, if for some function

f(n) = \omega(1)

and

\alpha \in [1/2, 1)

k = n/f(n)

and

\eta = o(f(n)^{- \alpha}/\log n)

, then there is an algorithm for the problem with running time

poly(n)\cdot {n \choose k}^{1-\alpha} \cdot e^{-k/4.01}

arXiv.org e-Print Archive

QFix: Diagnosing errors through query histories

Author: Meliou Alexandra
Wang Xiaolan
Wu Eugene
Publication venue
Publication date: 11/02/2016
Field of study

Data-driven applications rely on the correctness of their data to function properly and effectively. Errors in data can be incredibly costly and disruptive, leading to loss of revenue, incorrect conclusions, and misguided policy decisions. While data cleaning tools can purge datasets of many errors before the data is used, applications and users interacting with the data can introduce new errors. Subsequent valid updates can obscure these errors and propagate them through the dataset causing more discrepancies. Even when some of these discrepancies are discovered, they are often corrected superficially, on a case-by-case basis, further obscuring the true underlying cause, and making detection of the remaining errors harder. In this paper, we propose QFix, a framework that derives explanations and repairs for discrepancies in relational data, by analyzing the effect of queries that operated on the data and identifying potential mistakes in those queries. QFix is flexible, handling scenarios where only a subset of the true discrepancies is known, and robust to different types of update workloads. We make four important contributions: (a) we formalize the problem of diagnosing the causes of data errors based on the queries that operated on and introduced errors to a dataset; (b) we develop exact methods for deriving diagnoses and fixes for identified errors using state-of-the-art tools; (c) we present several optimization techniques that improve our basic approach without compromising accuracy, and (d) we leverage a tradeoff between accuracy and performance to scale diagnosis to large datasets and query logs, while achieving near-optimal results. We demonstrate the effectiveness of QFix through extensive evaluation over benchmark and synthetic data

arXiv.org e-Print Archive

Exact Learning from an Honest Teacher That Answers Membership Queries

Author: Bshouty Nader H.
Publication venue
Publication date: 13/06/2017
Field of study

Given a teacher that holds a function

f:X\to R

from some class of functions

C

. The teacher can receive from the learner an element~

d

in the domain

X

(a query) and returns the value of the function in

d

f(d)\in R

. The learner goal is to find

f

with a minimum number of queries, optimal time complexity, and optimal resources. In this survey, we present some of the results known from the literature, different techniques used, some new problems, and open problems

arXiv.org e-Print Archive

Contextual Memory Trees

Author: Beygelzimer Alina
Daumé III Hal
Langford John
Mineiro Paul
Sun Wen
Publication venue
Publication date: 02/06/2019
Field of study

We design and study a Contextual Memory Tree (CMT), a learning memory controller that inserts new memories into an experience store of unbounded size. It is designed to efficiently query for memories from that store, supporting logarithmic time insertion and retrieval operations. Hence CMT can be integrated into existing statistical learning algorithms as an augmented memory unit without substantially increasing training and inference computation. Furthermore CMT operates as a reduction to classification, allowing it to benefit from advances in representation or architecture. We demonstrate the efficacy of CMT by augmenting existing multi-class and multi-label classification algorithms with CMT and observe statistical improvement. We also test CMT learning on several image-captioning tasks to demonstrate that it performs computationally better than a simple nearest neighbors memory system while benefitting from reward learning.Comment: ICM 201

arXiv.org e-Print Archive

A Signaling Game Approach to Databases Querying and Interaction

Author: Ghadakchi Vahid
McCamish Ben
Termehchy Arash
Touri Behrouz
Publication venue
Publication date: 04/05/2018
Field of study

As most users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database management systems (DBMS) may interact with users and use their feedback on the returned results to learn the information needs behind their queries. Current query interfaces assume that users do not learn and modify the way way they express their information needs in form of queries during their interaction with the DBMS. Using a real-world interaction workload, we show that users learn and modify how to express their information needs during their interactions with the DBMS and their learning is accurately modeled by a well-known reinforcement learning mechanism. As current data interaction systems assume that users do not modify their strategies, they cannot discover the information needs behind users' queries effectively. We model the interaction between users and DBMS as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose a reinforcement learning method that learns and answers the information needs behind queries and adapts to the changes in users' strategies and prove that it improves the effectiveness of answering queries stochastically speaking. We propose two efficient implementation of this method over large relational databases. Our extensive empirical studies over real-world query workloads indicate that our algorithms are efficient and effective.Comment: 21 page

arXiv.org e-Print Archive

An Imitation Game for Learning Semantic Parsers from User Interaction

Author: Su Yu
Sun Huan
Tang Yiqi
Yao Ziyu
Yih Wen-tau
Publication venue
Publication date: 05/10/2020
Field of study

Despite the widely successful applications, bootstrapping and fine-tuning semantic parsers are still a tedious process with challenges such as costly data annotation and privacy risks. In this paper, we suggest an alternative, human-in-the-loop methodology for learning semantic parsers directly from users. A semantic parser should be introspective of its uncertainties and prompt for user demonstration when uncertain. In doing so it also gets to imitate the user behavior and continue improving itself autonomously with the hope that eventually it may become as good as the user in interpreting their questions. To combat the sparsity of demonstration, we propose a novel annotation-efficient imitation learning algorithm, which iteratively collects new datasets by mixing demonstrated states and confident predictions and re-trains the semantic parser in a Dataset Aggregation fashion (Ross et al., 2011). We provide a theoretical analysis of its cost bound and also empirically demonstrate its promising performance on the text-to-SQL problem. Code will be available at https://github.com/sunlab-osu/MISP.Comment: Accepted to EMNLP 2020. 20 pages, 6 figure

arXiv.org e-Print Archive

Reducing Uncertainty of Schema Matching via Crowdsourcing with Accuracy Rates

Author: Chen Lei
Jagadish H. V.
Tong Yongxin
Zhang Chen Jason
Zhang Mengchen
Publication venue
Publication date: 11/09/2018
Field of study

Schema matching is a central challenge for data integration systems. Inspired by the popularity and the success of crowdsourcing platforms, we explore the use of crowdsourcing to reduce the uncertainty of schema matching. Since crowdsourcing platforms are most effective for simple questions, we assume that each Correspondence Correctness Question (CCQ) asks the crowd to decide whether a given correspondence should exist in the correct matching. Furthermore, members of a crowd may sometimes return incorrect answers with different probabilities. Accuracy rates of individual crowd workers are probabilities of returning correct answers which can be attributes of CCQs as well as evaluations of individual workers. We prove that uncertainty reduction equals to entropy of answers minus entropy of crowds and show how to obtain lower and upper bounds for it. We propose frameworks and efficient algorithms to dynamically manage the CCQs to maximize the uncertainty reduction within a limited budget of questions. We develop two novel approaches, namely `Single CCQ' and `Multiple CCQ', which adaptively select, publish and manage questions. We verify the value of our solutions with simulation and real implementation.Comment: 15 page

arXiv.org e-Print Archive

Pay Attention to Those Sets! Learning Quantification from Images

Author: Bernardi Raffaella
Dimiccoli Mariella
Herbelot Aurélie
Pezzelle Sandro
Sorodoc Ionut
Publication venue
Publication date: 10/04/2017
Field of study

Major advances have recently been made in merging language and vision representations. But most tasks considered so far have confined themselves to the processing of objects and lexicalised relations amongst objects (content words). We know, however, that humans (even pre-school children) can abstract over raw data to perform certain types of higher-level reasoning, expressed in natural language by function words. A case in point is given by their ability to learn quantifiers, i.e. expressions like 'few', 'some' and 'all'. From formal semantics and cognitive linguistics, we know that quantifiers are relations over sets which, as a simplification, we can see as proportions. For instance, in 'most fish are red', most encodes the proportion of fish which are red fish. In this paper, we study how well current language and vision strategies model such relations. We show that state-of-the-art attention mechanisms coupled with a traditional linguistic formalisation of quantifiers gives best performance on the task. Additionally, we provide insights on the role of 'gist' representations in quantification. A 'logical' strategy to tackle the task would be to first obtain a numerosity estimation for the two involved sets and then compare their cardinalities. We however argue that precisely identifying the composition of the sets is not only beyond current state-of-the-art models but perhaps even detrimental to a task that is most efficiently performed by refining the approximate numerosity estimator of the system.Comment: Submitted to Journal Paper, 28 pages, 12 figures, 5 table

arXiv.org e-Print Archive

Dual Purpose Hashing

Author: Chen Xilin
Liu Haomiao
Shan Shiguang
Wang Ruiping
Publication venue
Publication date: 19/07/2016
Field of study

Recent years have seen more and more demand for a unified framework to address multiple realistic image retrieval tasks concerning both category and attributes. Considering the scale of modern datasets, hashing is favorable for its low complexity. However, most existing hashing methods are designed to preserve one single kind of similarity, thus improper for dealing with the different tasks simultaneously. To overcome this limitation, we propose a new hashing method, named Dual Purpose Hashing (DPH), which jointly preserves the category and attribute similarities by exploiting the Convolutional Neural Network (CNN) models to hierarchically capture the correlations between category and attributes. Since images with both category and attribute labels are scarce, our method is designed to take the abundant partially labelled images on the Internet as training inputs. With such a framework, the binary codes of new-coming images can be readily obtained by quantizing the network outputs of a binary-like layer, and the attributes can be recovered from the codes easily. Experiments on two large-scale datasets show that our dual purpose hash codes can achieve comparable or even better performance than those state-of-the-art methods specifically designed for each individual retrieval task, while being more compact than the compared methods.Comment: With supplementary materials added to the en

arXiv.org e-Print Archive