The paper describes the use of Information
Extraction (IE), a Natural Language Processing (NLP)
technique to assist ‘rich’ semantic indexing of diverse
archaeological text resources. Such unpublished online
documents are often referred to as ‘Grey Literature’.
Established document indexing techniques are not sufficient to
satisfy user information needs that expand beyond the limits of
a simple term matching search. The focus of the research is to
direct a semantic-aware 'rich' indexing of diverse natural
language resources with properties capable of satisfying
information retrieval from on-line publications and datasets
associated with the Semantic Technologies for Archaeological
Resources (STAR) project in the UoG Hypermedia Research
Unit.
The study proposes the use of knowledge resources and
conceptual models to assist an Information Extraction process
able to provide ‘rich’ semantic indexing of archaeological
documents capable of resolving linguistic ambiguities of
indexed terms. CRM CIDOC-EH, a standard core ontology in
cultural heritage, and the English Heritage (EH) Thesauri for
archaeological concepts are employed to drive the Information
Extraction process and to support the aims of a semantic
framework in which indexed terms are capable of supporting
semantic-aware access to on-line resources. The paper
describes the process of semantic indexing of archaeological
concepts (periods and finds) in a corpus of 535 grey literature
documents using a rule based Information Extraction
technique facilitated by the General Architecture of Text
Engineering (GATE) toolkit and expressed by Java Annotation
Pattern Engine (JAPE) rules. Illustrative examples
demonstrate the different stages of the process.
Initial results suggest that the combination of information
extraction with knowledge resources and standard core
conceptual models is capable of supporting semantic aware and
linguistically disambiguate term indexing