2 research outputs found
Information Extraction in Illicit Domains
Extracting useful entities and attribute values from illicit domains such as
human trafficking is a challenging problem with the potential for widespread
social impact. Such domains employ atypical language models, have `long tails'
and suffer from the problem of concept drift. In this paper, we propose a
lightweight, feature-agnostic Information Extraction (IE) paradigm specifically
designed for such domains. Our approach uses raw, unlabeled text from an
initial corpus, and a few (12-120) seed annotations per domain-specific
attribute, to learn robust IE models for unobserved pages and websites.
Empirically, we demonstrate that our approach can outperform feature-centric
Conditional Random Field baselines by over 18\% F-Measure on five annotated
sets of real-world human trafficking datasets in both low-supervision and
high-supervision settings. We also show that our approach is demonstrably
robust to concept drift, and can be efficiently bootstrapped even in a serial
computing environment.Comment: 10 pages, ACM WWW 201
Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages
Extracting geographical tags from webpages is a well-motivated application in
many domains. In illicit domains with unusual language models, like human
trafficking, extracting geotags with both high precision and recall is a
challenging problem. In this paper, we describe a geotag extraction framework
in which context, constraints and the openly available Geonames knowledge base
work in tandem in an Integer Linear Programming (ILP) model to achieve good
performance. In preliminary empirical investigations, the framework improves
precision by 28.57% and F-measure by 36.9% on a difficult human trafficking
geotagging task compared to a machine learning-based baseline. The method is
already being integrated into an existing knowledge base construction system
widely used by US law enforcement agencies to combat human trafficking.Comment: 6 pages, GeoRich 2017 workshop at ACM SIGMOD conferenc