1 research outputs found
Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora
This thesis describes the development and in-depth empirical investigation of a
method, called BootMark, for bootstrapping the marking up of named entities
in textual documents. The reason for working with documents, as opposed to
for instance sentences or phrases, is that the BootMark method is concerned
with the creation of corpora. The claim made in the thesis is that BootMark
requires a human annotator to manually annotate fewer documents in order to
produce a named entity recognizer with a given performance, than would be
needed if the documents forming the basis for the recognizer were randomly
drawn from the same corpus. The intention is then to use the created named en-
tity recognizer as a pre-tagger and thus eventually turn the manual annotation
process into one in which the annotator reviews system-suggested annotations
rather than creating new ones from scratch. The BootMark method consists of
three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping
– active machine learning for the purpose of selecting which document to an-
notate next; (3) The remaining unannotated documents of the original corpus
are marked up using pre-tagging with revision.
Five emerging issues are identified, described and empirically investigated
in the thesis. Their common denominator is that they all depend on the real-
ization of the named entity recognition task, and as such, require the context
of a practical setting in order to be properly addressed. The emerging issues
are related to: (1) the characteristics of the named entity recognition task and
the base learners used in conjunction with it; (2) the constitution of the set of
documents annotated by the human annotator in phase one in order to start the
bootstrapping process; (3) the active selection of the documents to annotate in
phase two; (4) the monitoring and termination of the active learning carried out
in phase two, including a new intrinsic stopping criterion for committee-based
active learning; and (5) the applicability of the named entity recognizer created
during phase two as a pre-tagger in phase three.
The outcomes of the empirical investigations concerning the emerging is-
sues support the claim made in the thesis. The results also suggest that while
the recognizer produced in phases one and two is as useful for pre-tagging as
a recognizer created from randomly selected documents, the applicability of
the recognizer as a pre-tagger is best investigated by conducting a user study
involving real annotators working on a real named entity recognition task