Location of Repository

Using word sense discrimination on historic document collections

By Nina Tahmasebi, Kai Niklas, Thomas Risse and Thomas Theuerkauf

Abstract

Word sense discrimination is the first, important step towards automatic detection of language evolution within large, historic document collections. By comparing the found word senses over time, we can reveal and use important information that will improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today’s language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations in this paper are performed on The Times Archive, a collection of newspaper article

Topics: Error Impact, Historic Document Collections
Publisher: ACM
Year: 2010
OAI identifier: oai:CiteSeerX.psu:10.1.1.417.8932
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://liwa-project.eu/images/... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.