Scholars in inter-disciplinary fields like the
Digital Humanities are increasingly interested
in semantic annotation of specialized corpora.
Yet, under-resourced languages, imperfect or
noisily structured data, and user-specific classification tasks make it difficult to meet their
needs using off-the-shelf models. Manual annotation of large corpora from scratch, meanwhile, can be prohibitively expensive. Thus,
we propose an active learning solution for
named entity recognition, attempting to maximize a custom model’s improvement per additional unit of manual annotation. Our system
robustly handles any domain or user-defined
label set and requires no external resources,
enabling quality named entity recognition for
Humanities corpora where such resources are
not available. Evaluating on typologically disparate languages and datasets, we reduce required annotation by 20-60% and greatly outperform a competitive active learning baseline.New York University–Paris Sciences Lettres Global Alliance grant; National Endowment for the Humanities grant, award HAA-256078-17; Computational Approaches to Modeling Language lab
at New York University Abu Dhab