Location of Repository

Hierarchical word clustering - automatic thesaurus generation

By V J Hodge and J Austin


In this paper, we propose a hierarchical, lexical clustering neural network algorithm that automatically generates a thesaurus (synonym abstraction) using purely stochastic information derived from unstructured text corpora and requiring no prior word classifications. The lexical hierarchy overcomes the Vocabulary Problem by accommodating paraphrasing through using synonym clusters and overcomes Information Overload by focusing search within cohesive clusters. We describe existing word categorisation methodologies, identifying their respective strengths and weaknesses and evaluate our proposed approach against an existing neural approach using a benchmark statistical approach and a human generated thesaurus for comparison. We also evaluate our word context vector generation methodology against two similar approaches to investigate the effect of word vector dimensionality and the effect of the number of words in the context window on the quality of word clusters produced. We demonstrate the effectiveness of our approach and its superiority to existing techniques. (C) 2002 Elsevier Science B.V. All rights reserved

Year: 2002
DOI identifier: 10.1016/S0925-2312(01)00675-0
OAI identifier: oai:eprints.whiterose.ac.uk:882

Suggested articles


To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.