Abstract. Adding annotations to documents by extracting data from text yields richer document representation, which users can exploit for various tasks such as search and browsing. However, data extraction is hard, especially in large-scale heterogeneous settings. A more focused technique for data extraction is entity linking, which does not extract new data from documents, but links words in documents to referent entities in existing structured datasets or ontologies. We follow this direction in the sense that given some documents, we also aim at finding entities (ontology concepts in our case) that can be used for creating document annotations. However, we emphasize the role of users in this annotation creation process such that the concepts we search for are not directly annotations, but candidate annotations as well as their contexts forming annotation modules that are then employed by users for creating the annotations manually. We propose a technique, which efficiently computes annotation candidates based on a coarse-grained topic-based representation of documents and ontology concepts. Aiming at maximizing compactness while preserving useful information, we also elaborate on a module extraction technique, which considers only annotation candidates and context elements that are “on-topic”, i.e. share topics with the documents to be annotated. Initial experiments show promising results as well as the needs for much more research work towards this new direction of ontology-based annotation support.
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.