30,349 research outputs found
Information Extraction in Illicit Domains
Extracting useful entities and attribute values from illicit domains such as
human trafficking is a challenging problem with the potential for widespread
social impact. Such domains employ atypical language models, have `long tails'
and suffer from the problem of concept drift. In this paper, we propose a
lightweight, feature-agnostic Information Extraction (IE) paradigm specifically
designed for such domains. Our approach uses raw, unlabeled text from an
initial corpus, and a few (12-120) seed annotations per domain-specific
attribute, to learn robust IE models for unobserved pages and websites.
Empirically, we demonstrate that our approach can outperform feature-centric
Conditional Random Field baselines by over 18\% F-Measure on five annotated
sets of real-world human trafficking datasets in both low-supervision and
high-supervision settings. We also show that our approach is demonstrably
robust to concept drift, and can be efficiently bootstrapped even in a serial
computing environment.Comment: 10 pages, ACM WWW 201
Representation of probabilistic scientific knowledge
This article is available through the Brunel Open Access Publishing Fund. Copyright © 2013 Soldatova et al; licensee BioMed Central Ltd.The theory of probability is widely used in biomedical research for data analysis and modelling. In previous work the probabilities of the research hypotheses have been recorded as experimental metadata. The ontology HELO is designed to support probabilistic reasoning, and provides semantic descriptors for reporting on research that involves operations with probabilities. HELO explicitly links research statements such as hypotheses, models, laws, conclusions, etc. to the associated probabilities of these statements being true. HELO enables the explicit semantic representation and accurate recording of probabilities in hypotheses, as well as the inference methods used to generate and update those hypotheses. We demonstrate the utility of HELO on three worked examples: changes in the probability of the hypothesis that sirtuins regulate human life span; changes in the probability of hypotheses about gene functions in the S. cerevisiae aromatic amino acid pathway; and the use of active learning in drug design (quantitative structure activity relation learning), where a strategy for the selection of compounds with the highest probability of improving on the best known compound was used. HELO is open source and available at https://github.com/larisa-soldatova/HELO.This work was partially supported by grant BB/F008228/1 from the UK Biotechnology & Biological Sciences Research Council, from the European Commission under the FP7 Collaborative Programme, UNICELLSYS, KU Leuven GOA/08/008 and ERC Starting Grant 240186
Ontologies and Information Extraction
This report argues that, even in the simplest cases, IE is an ontology-driven
process. It is not a mere text filtering method based on simple pattern
matching and keywords, because the extracted pieces of texts are interpreted
with respect to a predefined partial domain model. This report shows that
depending on the nature and the depth of the interpretation to be done for
extracting the information, more or less knowledge must be involved. This
report is mainly illustrated in biology, a domain in which there are critical
needs for content-based exploration of the scientific literature and which
becomes a major application domain for IE
- …