101,884 research outputs found

    Pattern Matching and Discourse Processing in Information Extraction from Japanese Text

    Full text link
    Information extraction is the task of automatically picking up information of interest from an unconstrained text. Information of interest is usually extracted in two steps. First, sentence level processing locates relevant pieces of information scattered throughout the text; second, discourse processing merges coreferential information to generate the output. In the first step, pieces of information are locally identified without recognizing any relationships among them. A key word search or simple pattern search can achieve this purpose. The second step requires deeper knowledge in order to understand relationships among separately identified pieces of information. Previous information extraction systems focused on the first step, partly because they were not required to link up each piece of information with other pieces. To link the extracted pieces of information and map them onto a structured output format, complex discourse processing is essential. This paper reports on a Japanese information extraction system that merges information using a pattern matcher and discourse processor. Evaluation results show a high level of system performance which approaches human performance.Comment: See http://www.jair.org/ for any accompanying file

    Extracting information from short messages

    Get PDF
    Much currently transmitted information takes the form of e-mails or SMS text messages and so extracting information from such short messages is increasingly important. The words in a message can be partitioned into the syntactic structure, terms from the domain of discourse and the data being transmitted. This paper describes a light-weight Information Extraction component which uses pattern matching to separate the three aspects: the structure is supplied as a template; domain terms are the metadata of a data source (or their synonyms), and data is extracted as those words matching placeholders in the templates

    Wrapper Maintenance: A Machine Learning Approach

    Full text link
    The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task

    Generating indicative-informative summaries with SumUM

    Get PDF
    We present and evaluate SumUM, a text summarization system that takes a raw technical text as input and produces an indicative informative summary. The indicative part of the summary identifies the topics of the document, and the informative part elaborates on some of these topics according to the reader's interest. SumUM motivates the topics, describes entities, and defines concepts. It is a first step for exploring the issue of dynamic summarization. This is accomplished through a process of shallow syntactic and semantic analysis, concept identification, and text regeneration. Our method was developed through the study of a corpus of abstracts written by professional abstractors. Relying on human judgment, we have evaluated indicativeness, informativeness, and text acceptability of the automatic summaries. The results thus far indicate good performance when compared with other summarization technologies

    Ensuring Readability and Data-fidelity using Head-modifier Templates in Deep Type Description Generation

    Full text link
    A type description is a succinct noun compound which helps human and machines to quickly grasp the informative and distinctive information of an entity. Entities in most knowledge graphs (KGs) still lack such descriptions, thus calling for automatic methods to supplement such information. However, existing generative methods either overlook the grammatical structure or make factual mistakes in generated texts. To solve these problems, we propose a head-modifier template-based method to ensure the readability and data fidelity of generated type descriptions. We also propose a new dataset and two automatic metrics for this task. Experiments show that our method improves substantially compared with baselines and achieves state-of-the-art performance on both datasets.Comment: ACL 201

    Information extraction

    Get PDF
    In this paper we present a new approach to extract relevant information by knowledge graphs from natural language text. We give a multiple level model based on knowledge graphs for describing template information, and investigate the concept of partial structural parsing. Moreover, we point out that expansion of concepts plays an important role in thinking, so we study the expansion of knowledge graphs to use context information for reasoning and merging of templates

    Mixing representation levels: The hybrid approach to automatic text generation

    Full text link
    Natural language generation systems (NLG) map non-linguistic representations into strings of words through a number of steps using intermediate representations of various levels of abstraction. Template based systems, by contrast, tend to use only one representation level, i.e. fixed strings, which are combined, possibly in a sophisticated way, to generate the final text. In some circumstances, it may be profitable to combine NLG and template based techniques. The issue of combining generation techniques can be seen in more abstract terms as the issue of mixing levels of representation of different degrees of linguistic abstraction. This paper aims at defining a reference architecture for systems using mixed representations. We argue that mixed representations can be used without abandoning a linguistically grounded approach to language generation.Comment: 6 page
    • …
    corecore