I describe a system, Txt2ids, that uses a series of regular expressions to extract suggestions for ontology identifier names from English text and classify them as (i) class names, (ii) individual names, (iii) object property names, or (iv) data property names. As well as being of practical use as a tool in an ontology authoring system, it also functions as a theoretical model of the syntactic organisation of identifier names. Regular expressions were derived from part-of-speech patterns in identifier names in a corpus of over 500 ontologies. Since ontology identifier names have syntactic structures that differ from natural English, the regular expressions were adapted. Extracted phrases were post-processed to comply with the structure of OWL Simplified English. A system sanity test achieved acceptable results when comparing identifiers extracted by Txt2ids (from texts that had been automatically generated by an ontology verbaliser from a large corpus of ontologies) with the original identifiers from the same corpus. Txt2ids tends to generate greater numbers of identifiers than were present in the original ontology; however, many of the additional ones seem reasonable suggestions. To assist in the design of a future system evaluation, a pilot study was conducted in which identifier names extracted by Txt2ids from short, expository texts compared favourably with those created by human users when building ontologies from the same texts. The system has been deployed in an ontology editor developed for the SWAT1 project.
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.