1,580 research outputs found

    Relation Extraction Using Convolution Tree Kernel Expanded with Entity Features

    Get PDF
    PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 200

    Survey on Kernel-Based Relation Extraction

    Get PDF

    Document Layout Analysis and Recognition Systems

    Get PDF
    Automatic extraction of relevant knowledge to domain-specific questions from Optical Character Recognition (OCR) documents is critical for developing intelligent systems, such as document search engines, sentiment analysis, and information retrieval, since hands-on knowledge extraction by a domain expert with a large volume of documents is intensive, unscalable, and time-consuming. There have been a number of studies that have automatically extracted relevant knowledge from OCR documents, such as ABBY and Sandford Natural Language Processing (NLP). Despite the progress, there are still limitations yet-to-be solved. For instance, NLP often fails to analyze a large document. In this thesis, we propose a knowledge extraction framework, which takes domain-specific questions as input and provides the most relevant sentence/paragraph to the given questions in the document. Overall, our proposed framework has two phases. First, an OCR document is reconstructed into a semi-structured document (a document with hierarchical structure of (sub)sections and paragraphs). Then, relevant sentence/paragraph for a given question is identified from the reconstructed semi structured document. Specifically, we proposed (1) a method that converts an OCR document into a semi structured document using text attributes such as font size, font height, and boldface (in Chapter 2), (2) an image-based machine learning method that extracts Table of Contents (TOC) to provide an overall structure of the document (in Chapter 3), (3) a document texture-based deep learning method (DoT-Net) that classifies types of blocks such as text, image, and table (in Chapter 4), and (4) a Question & Answer (Q&A) system that retrieves most relevant sentence/paragraph for a domain-specific question. A large number of document intelligent systems can benefit from our proposed automatic knowledge extraction system to construct a Q&A system for OCR documents. Our Q&A system has applied to extract domain specific information from business contracts at GE Power

    Crowdsourcing Cybersecurity: Cyber Attack Detection using Social Media

    Full text link
    Social media is often viewed as a sensor into various societal events such as disease outbreaks, protests, and elections. We describe the use of social media as a crowdsourced sensor to gain insight into ongoing cyber-attacks. Our approach detects a broad range of cyber-attacks (e.g., distributed denial of service (DDOS) attacks, data breaches, and account hijacking) in an unsupervised manner using just a limited fixed set of seed event triggers. A new query expansion strategy based on convolutional kernels and dependency parses helps model reporting structure and aids in identifying key event characteristics. Through a large-scale analysis over Twitter, we demonstrate that our approach consistently identifies and encodes events, outperforming existing methods.Comment: 13 single column pages, 5 figures, submitted to KDD 201

    PPI-IRO: A two-stage method for protein-protein interaction extraction based on interaction relation ontology

    Full text link
    Mining Protein-Protein Interactions (PPIs) from the fast-growing biomedical literature resources has been proven as an effective approach for the identifi cation of biological regulatory networks. This paper presents a novel method based on the idea of Interaction Relation Ontology (IRO), which specifi es and organises words of various proteins interaction relationships. Our method is a two-stage PPI extraction method. At fi rst, IRO is applied in a binary classifi er to determine whether sentences contain a relation or not. Then, IRO is taken to guide PPI extraction by building sentence dependency parse tree. Comprehensive and quantitative evaluations and detailed analyses are used to demonstrate the signifi cant performance of IRO on relation sentences classifi cation and PPI extraction. Our PPI extraction method yielded a recall of around 80% and 90% and an F1 of around 54% and 66% on corpora of AIMed and Bioinfer, respectively, which are superior to most existing extraction methods. Copyright © 2014 Inderscience Enterprises Ltd
    • …
    corecore