969 research outputs found

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web

    Slot Filling

    Get PDF
    Slot filling (SF) is the task of automatically extracting facts about particular entities from unstructured text, and populating a knowledge base (KB) with these facts. These structured KBs enable applications such as structured web queries and question answering. SF is typically framed as a query-oriented setting of the related task of relation extraction. Throughout this thesis, we reflect on how SF is a task with many distinct problems. We demonstrate that recall is a major limiter on SF system performance. We contribute an analysis of typical SF recall loss, and find a substantial amount of loss occurs early in the SF pipeline. We confirm that accurate NER and coreference resolution are required for high-recall SF. We measure upper bounds using a naïve graph-based semi-supervised bootstrapping technique, and find that only 39% of results are reachable using a typical feature space. We expect that this graph-based technique will be directly useful for extraction, and this leads us to frame SF as a label propagation task. We focus on a detailed graph representation of the task which reflects the behaviour and assumptions we want to model based on our analysis, including modifying the label propagation process to model multiple types of label interaction. Analysing the graph, we find that a large number of errors occur in very close proximity to training data, and identify that this is of major concern for propagation. While there are some conflicts caused by a lack of sufficient disambiguating context—we explore adding additional contextual features to address this—many of these conflicts are caused by subtle annotation problems. We find that lack of a standard for how explicit expressions of relations must be in text makes consistent annotation difficult. Using a strict definition of explicitness results in 20% of correct annotations being removed from a standard dataset. We contribute several annotation-driven analyses of this problem, exploring the definition of slots and the effect of the lack of a concrete definition of explicitness: annotation schema do not detail how explicit expressions of relations need to be, and there is large scope for disagreement between annotators. Additionally, applications may require relatively strict or relaxed evidence for extractions, but this is not considered in annotation tasks. We demonstrate that annotators frequently disagree on instances, dependent on differences in annotator world knowledge and thresholds on making probabilistic inference. SF is fundamental to enabling many knowledge-based applications, and this work motivates modelling and evaluating SF to better target these tasks

    Domain-Specific Knowledge Exploration with Ontology Hierarchical Re-Ranking and Adaptive Learning and Extension

    Full text link
    The goal of this research project is the realization of an artificial intelligence-driven lightweight domain knowledge search framework that returns a domain knowledge structure upon request with highly relevant web resources via a set of domain-centric re-ranking algorithms and adaptive ontology learning models. The re-ranking algorithm, a necessary mechanism to counter-play the heterogeneity and unstructured nature of web data, uses augmented queries and a hierarchical taxonomic structure to get further insight into the initial search results obtained from credited generic search engines. A semantic weight scale is applied to each node in the ontology graph and in turn generates a matrix of aggregated link relation scores that is used to compute the likely semantic correspondence between nodes and documents. Bootstrapped with a light-weight seed domain ontology, the theoretical platform focuses on the core back-end building blocks, employing two supervised automated learning models as well as semi-automated verification processes to progressively enhance, prune, and inspect the domain ontology to formulate a growing, up-to-date, and veritable system.\\ The framework provides an in-depth knowledge search platform and enhances user knowledge acquisition experience. With minimum footprint, the system stores only necessary metadata of possible domain knowledge searches, in order to provide fast fetching and caching. In addition, the re-ranking and ontology learning processes can be operated offline or in a preprocessing stage, the system therefore carries no significant overhead at runtime

    Ontologies and Information Extraction

    Full text link
    This report argues that, even in the simplest cases, IE is an ontology-driven process. It is not a mere text filtering method based on simple pattern matching and keywords, because the extracted pieces of texts are interpreted with respect to a predefined partial domain model. This report shows that depending on the nature and the depth of the interpretation to be done for extracting the information, more or less knowledge must be involved. This report is mainly illustrated in biology, a domain in which there are critical needs for content-based exploration of the scientific literature and which becomes a major application domain for IE

    Geppetto: Versatile Verifiable Computation

    Get PDF
    Cloud computing sparked interest in Verifiable Computation protocols, which allow a weak client to securely outsource computations to remote parties. Recent work has dramatically reduced the client’s cost to verify the correctness of results, but the overhead to produce proofs largely remains impractical. Geppetto introduces complementary techniques for reducing prover overhead and increasing prover flexibility. With Multi-QAPs, Geppetto reduces the cost of sharing state between computations (e.g., for MapReduce) or within a single computation by up to two orders of magnitude. Via a careful instantiation of cryptographic primitives, Geppetto also brings down the cost of verifying outsourced cryptographic computations (e.g., verifiably computing on signed data); together with Geppetto’s notion of bounded proof bootstrapping, Geppetto improves on prior bootstrapped systems by five orders of magnitude, albeit at some cost in universality. Geppetto also supports qualitatively new properties like verifying the correct execution of proprietary (i.e., secret) algorithms. Finally, Geppetto’s use of energy-saving circuits brings the prover’s costs more in line with the program’s actual (rather than worst-case) execution time. Geppetto is implemented in a full-fledged, scalable compiler that consumes LLVM code generated from a variety of apps, as well as a large cryptographic library

    Adapting Searchy to extract data using evolved wrappers

    Full text link
    This is the author’s version of a work that was accepted for publication inExpert Systems with Applications: An International Journal. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Expert Systems with Applications: An International Journal, 39, 3 (2012) DOI: 10.1016/j.eswa.2011.08.168Organizations need diverse information systems to deal with the increasing requirements in information storage and processing, yielding the creation of information islands and therefore an intrinsic difficulty to obtain a global view. Being able to provide such an unified view of the -likely heterogeneous-information available in an organization is a goal that provides added-value to the information systems and has been subject of intense research. In this paper we present an extension of a solution named Searchy, an agent-based mediator system specialized in data extraction and Integration. Through the use of a set of wrappers, it integrates information from arbitrary sources and semantically translates them according to a mediated scheme. Searchy is actually a domain-independent wrapper container that ease wrapper development, providing, for example, semantic mapping. The extension of Searchy proposed in this paper introduces an evolutionary wrapper that is able to evolve wrappers using regular expressions. To achieve this, a Genetic Algorithm (GA) is used to learn a regex able to extract a set of positive samples while rejects a set of negative samples.The authors gratefully acknowledge Mart´ın Knoblauch for his useful suggestions and valuable comments. This work has been partially supported by the Spanish Ministry of Science and Innovation under the projects ABANT (TIN 2010-19872), COMPUBIODIVE (TIN2007-65989) and by Castilla-La Mancha project PEII09-0266-6640
    • …
    corecore