23,025 research outputs found

    A Comparative Study of the Application of Different Learning Techniques to Natural Language Interfaces

    Full text link
    In this paper we present first results from a comparative study. Its aim is to test the feasibility of different inductive learning techniques to perform the automatic acquisition of linguistic knowledge within a natural language database interface. In our interface architecture the machine learning module replaces an elaborate semantic analysis component. The learning module learns the correct mapping of a user's input to the corresponding database command based on a collection of past input data. We use an existing interface to a production planning and control system as evaluation and compare the results achieved by different instance-based and model-based learning algorithms.Comment: 10 pages, to appear CoNLL9

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Near-Optimal Induced Universal Graphs for Bounded Degree Graphs

    Get PDF
    A graph UU is an induced universal graph for a family FF of graphs if every graph in FF is a vertex-induced subgraph of UU. For the family of all undirected graphs on nn vertices Alstrup, Kaplan, Thorup, and Zwick [STOC 2015] give an induced universal graph with O ⁣(2n/2)O\!\left(2^{n/2}\right) vertices, matching a lower bound by Moon [Proc. Glasgow Math. Assoc. 1965]. Let k=D/2k= \lceil D/2 \rceil. Improving asymptotically on previous results by Butler [Graphs and Combinatorics 2009] and Esperet, Arnaud and Ochem [IPL 2008], we give an induced universal graph with O ⁣(k2kk!nk)O\!\left(\frac{k2^k}{k!}n^k \right) vertices for the family of graphs with nn vertices of maximum degree DD. For constant DD, Butler gives a lower bound of Ω ⁣(nD/2)\Omega\!\left(n^{D/2}\right). For an odd constant D3D\geq 3, Esperet et al. and Alon and Capalbo [SODA 2008] give a graph with O ⁣(nk1D)O\!\left(n^{k-\frac{1}{D}}\right) vertices. Using their techniques for any (including constant) even values of DD gives asymptotically worse bounds than we present. For large DD, i.e. when D=Ω(log3n)D = \Omega\left(\log^3 n\right), the previous best upper bound was (nD/2)nO(1){n\choose\lceil D/2\rceil} n^{O(1)} due to Adjiashvili and Rotbart [ICALP 2014]. We give upper and lower bounds showing that the size is (n/2D/2)2±O~(D){\lfloor n/2\rfloor\choose\lfloor D/2 \rfloor}2^{\pm\tilde{O}\left(\sqrt{D}\right)}. Hence the optimal size is 2O~(D)2^{\tilde{O}(D)} and our construction is within a factor of 2O~(D)2^{\tilde{O}\left(\sqrt{D}\right)} from this. The previous results were larger by at least a factor of 2Ω(D)2^{\Omega(D)}. As a part of the above, proving a conjecture by Esperet et al., we construct an induced universal graph with 2n12n-1 vertices for the family of graphs with max degree 22. In addition, we give results for acyclic graphs with max degree 22 and cycle graphs. Our results imply the first labeling schemes that for any DD are at most o(n)o(n) bits from optimal
    corecore