62,288 research outputs found

    Wrapper Maintenance: A Machine Learning Approach

    Full text link
    The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task

    A Syntactic Neural Model for General-Purpose Code Generation

    Full text link
    We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing data-driven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge. Experiments find this an effective way to scale up to generation of complex programs from natural language descriptions, achieving state-of-the-art results that well outperform previous code generation and semantic parsing approaches.Comment: To appear in ACL 201

    Phrase structure grammars as indicative of uniquely human thoughts

    Get PDF
    I argue that the ability to compute phrase structure grammars is indicative of a particular kind of thought. This type of thought that is only available to cognitive systems that have access to the computations that allow the generation and interpretation of the structural descriptions of phrase structure grammars. The study of phrase structure grammars, and formal language theory in general, is thus indispensable to studies of human cognition, for it makes explicit both the unique type of human thought and the underlying mechanisms in virtue of which this thought is made possible

    Linguistic and metalinguistic categories in second language learning

    Get PDF
    This paper discusses proposed characteristics of implicit linguistic and explicit metalinguistic knowledge representations as well as the properties of implicit and explicit processes believed to operate on these representations. In accordance with assumptions made in the usage-based approach to language and language acquisition, it is assumed that implicit linguistic knowledge is represented in terms of flexible and context-dependent categories which are subject to similarity-based processing. It is suggested that, by contrast, explicit metalinguistic knowledge is characterized by stable and discrete Aristotelian categories which subserve conscious, rule-based processing. The consequences of these differences in category structure and processing mechanisms for the usefulness or otherwise of metalinguistic knowledge in second language learning and performance are explored. Reference is made to existing empirical and theoretical research about the role of metalinguistic knowledge in second language acquisition, and specific empirical predictions arising out of the line of argument adopted in the current paper are put forward. © Walter de Gruyter 2008

    Robust Processing of Natural Language

    Full text link
    Previous approaches to robustness in natural language processing usually treat deviant input by relaxing grammatical constraints whenever a successful analysis cannot be provided by ``normal'' means. This schema implies, that error detection always comes prior to error handling, a behaviour which hardly can compete with its human model, where many erroneous situations are treated without even noticing them. The paper analyses the necessary preconditions for achieving a higher degree of robustness in natural language processing and suggests a quite different approach based on a procedure for structural disambiguation. It not only offers the possibility to cope with robustness issues in a more natural way but eventually might be suited to accommodate quite different aspects of robust behaviour within a single framework.Comment: 16 pages, LaTeX, uses pstricks.sty, pstricks.tex, pstricks.pro, pst-node.sty, pst-node.tex, pst-node.pro. To appear in: Proc. KI-95, 19th German Conference on Artificial Intelligence, Bielefeld (Germany), Lecture Notes in Computer Science, Springer 199

    Substructure Discovery Using Minimum Description Length and Background Knowledge

    Full text link
    The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our SUBDUE substructure discovery system based on the minimum description length principle. The SUBDUE system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. SUBDUE uses a computationally-bounded inexact graph match that identifies similar, but not identical, instances of a substructure and finds an approximate measure of closeness of two substructures when under computational constraints. In addition to the minimum description length principle, other background knowledge can be used by SUBDUE to guide the search towards more appropriate substructures. Experiments in a variety of domains demonstrate SUBDUE's ability to find substructures capable of compressing the original data and to discover structural concepts important to the domain. Description of Online Appendix: This is a compressed tar file containing the SUBDUE discovery system, written in C. The program accepts as input databases represented in graph form, and will output discovered substructures with their corresponding value.Comment: See http://www.jair.org/ for an online appendix and other files accompanying this articl
    corecore