research

This Table is Different: A WordNet-Based Approach to Identifying References to Document Entities

Abstract

Writing intended to inform frequently con-tains references to document entities (DEs), a mixed class that includes orthographically structured items (e.g., illustrations, sections, lists) and discourse entities (arguments, sug-gestions, points). Such references are vital to the interpretation of documents, but they of-ten eschew identifiers such as "Figure 1 " for inexplicit phrases like "in this figure " or "from these premises". We examine inexplicit references to DEs, termed DE references, and recast the problem of their automatic detec-tion into the determination of relevant word senses. We then show the feasibility of ma-chine learning for the detection of DE-relevant word senses, using a corpus of hu-man-labeled synsets from WordNet. We test cross-domain performance by gathering lemmas and synsets from three corpora: web-site privacy policies, Wikipedia articles, and Wikibooks textbooks. Identifying DE refer-ences will enable language technologies to use the information encoded by them, permit-ting the automatic generation of finely-tuned descriptions of DEs and the presentation of richly-structured information to readers.

    Similar works