30,349 research outputs found

    Information Extraction in Illicit Domains

    Full text link
    Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have `long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18\% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.Comment: 10 pages, ACM WWW 201

    Representation of probabilistic scientific knowledge

    Get PDF
    This article is available through the Brunel Open Access Publishing Fund. Copyright © 2013 Soldatova et al; licensee BioMed Central Ltd.The theory of probability is widely used in biomedical research for data analysis and modelling. In previous work the probabilities of the research hypotheses have been recorded as experimental metadata. The ontology HELO is designed to support probabilistic reasoning, and provides semantic descriptors for reporting on research that involves operations with probabilities. HELO explicitly links research statements such as hypotheses, models, laws, conclusions, etc. to the associated probabilities of these statements being true. HELO enables the explicit semantic representation and accurate recording of probabilities in hypotheses, as well as the inference methods used to generate and update those hypotheses. We demonstrate the utility of HELO on three worked examples: changes in the probability of the hypothesis that sirtuins regulate human life span; changes in the probability of hypotheses about gene functions in the S. cerevisiae aromatic amino acid pathway; and the use of active learning in drug design (quantitative structure activity relation learning), where a strategy for the selection of compounds with the highest probability of improving on the best known compound was used. HELO is open source and available at https://github.com/larisa-soldatova/HELO.This work was partially supported by grant BB/F008228/1 from the UK Biotechnology & Biological Sciences Research Council, from the European Commission under the FP7 Collaborative Programme, UNICELLSYS, KU Leuven GOA/08/008 and ERC Starting Grant 240186

    Ontologies and Information Extraction

    Full text link
    This report argues that, even in the simplest cases, IE is an ontology-driven process. It is not a mere text filtering method based on simple pattern matching and keywords, because the extracted pieces of texts are interpreted with respect to a predefined partial domain model. This report shows that depending on the nature and the depth of the interpretation to be done for extracting the information, more or less knowledge must be involved. This report is mainly illustrated in biology, a domain in which there are critical needs for content-based exploration of the scientific literature and which becomes a major application domain for IE
    corecore