222,862 research outputs found

    Geoscience-aware deep learning:A new paradigm for remote sensing

    Get PDF
    Information extraction is a key activity for remote sensing images. A common distinction exists between knowledge-driven and data-driven methods. Knowledge-driven methods have advanced reasoning ability and interpretability, but have difficulty in handling complicated tasks since prior knowledge is usually limited when facing the highly complex spatial patterns and geoscience phenomena found in reality. Data-driven models, especially those emerging in machine learning (ML) and deep learning (DL), have achieved substantial progress in geoscience and remote sensing applications. Although DL models have powerful feature learning and representation capabilities, traditional DL has inherent problems including working as a black box and generally requiring a large number of labeled training data. The focus of this paper is on methods that integrate domain knowledge, such as geoscience knowledge and geoscience features (GK/GFs), into the design of DL models. The paper introduces the new paradigm of geoscience-aware deep learning (GADL), in which GK/GFs and DL models are combined deeply to extract information from remote sensing data. It first provides a comprehensive summary of GK/GFs used in GADL, which forms the basis for subsequent integration of GK/GFs with DL models. This is followed by a taxonomy of approaches for integrating GK/GFs with DL models. Several approaches are detailed using illustrative examples. Challenges and research prospects in GADL are then discussed. Developing more novel and advanced methods in GADL is expected to become the prevailing trend in advancing remotely sensed information extraction in the future.</p

    Improvements in Information Extraction in Legal Text by Active Learning

    Get PDF
    International audienceManaging licensing information and data rights is becoming a crucial issue in the Linked (Open) Data scenario. An open problem in this scenario is how to associate machine-readable licenses specifications to the data, so that automated approaches to treat such information can be fruitfully exploited to avoid data misuse. This means that we need a way to automatically extract from a natural language document specifying a certain license a machine-readable description of the terms of use and reuse identified in such license. Ontology-based Information Extraction is crucial to translate natural language documents into Linked Data. This connection supports consumers in navigating documents and semantically related data. However , the performances of automated information extraction systems are far from being perfect, and rely heavily on human intervention, either to create heuristics, to annotate examples for inferring models, or to interpret or validate patterns emerging from data. In this paper, we apply different Active Learning strategies to Information Extraction (IE) from licenses in English, with highly repetitive text, few annotated or unannotated examples available, and very fine precision needed. We show that the most popular approach to active learning, i.e., uncertainty sampling for instance selection, does not provide a good performance in this setting. We show that we can obtain a similar effect to that of density-based methods using uncertainty sampling , by just reversing the ranking criterion, and choosing the most certain instead of the most uncertain instances

    Wrapper Maintenance: A Machine Learning Approach

    Full text link
    The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task

    Knowledge Base Population using Semantic Label Propagation

    Get PDF
    A crucial aspect of a knowledge base population system that extracts new facts from text corpora, is the generation of training data for its relation extractors. In this paper, we present a method that maximizes the effectiveness of newly trained relation extractors at a minimal annotation cost. Manual labeling can be significantly reduced by Distant Supervision, which is a method to construct training data automatically by aligning a large text corpus with an existing knowledge base of known facts. For example, all sentences mentioning both 'Barack Obama' and 'US' may serve as positive training instances for the relation born_in(subject,object). However, distant supervision typically results in a highly noisy training set: many training sentences do not really express the intended relation. We propose to combine distant supervision with minimal manual supervision in a technique called feature labeling, to eliminate noise from the large and noisy initial training set, resulting in a significant increase of precision. We further improve on this approach by introducing the Semantic Label Propagation method, which uses the similarity between low-dimensional representations of candidate training instances, to extend the training set in order to increase recall while maintaining high precision. Our proposed strategy for generating training data is studied and evaluated on an established test collection designed for knowledge base population tasks. The experimental results show that the Semantic Label Propagation strategy leads to substantial performance gains when compared to existing approaches, while requiring an almost negligible manual annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge Bases for Natural Language Processin

    A Machine Learning Approach For Opinion Holder Extraction In Arabic Language

    Full text link
    Opinion mining aims at extracting useful subjective information from reliable amounts of text. Opinion mining holder recognition is a task that has not been considered yet in Arabic Language. This task essentially requires deep understanding of clauses structures. Unfortunately, the lack of a robust, publicly available, Arabic parser further complicates the research. This paper presents a leading research for the opinion holder extraction in Arabic news independent from any lexical parsers. We investigate constructing a comprehensive feature set to compensate the lack of parsing structural outcomes. The proposed feature set is tuned from English previous works coupled with our proposed semantic field and named entities features. Our feature analysis is based on Conditional Random Fields (CRF) and semi-supervised pattern recognition techniques. Different research models are evaluated via cross-validation experiments achieving 54.03 F-measure. We publicly release our own research outcome corpus and lexicon for opinion mining community to encourage further research

    Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art

    Get PDF
    Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover

    Automatic extraction of paraphrastic phrases from medium size corpora

    Full text link
    This paper presents a versatile system intended to acquire paraphrastic phrases from a representative corpus. In order to decrease the time spent on the elaboration of resources for NLP system (for example Information Extraction, IE hereafter), we suggest to use a machine learning system that helps defining new templates and associated resources. This knowledge is automatically derived from the text collection, in interaction with a large semantic network
    • …
    corecore