Skip to main content
Article thumbnail
Location of Repository

ILK, Computational Linguistics

By Antal Van Den Bosch, Ton Weijters, H. Jaap Van Den Herik and Walter Daelemans


Machine learning is becoming recognised as a source of generic and powerful tools for tasks studied and implemented in language technology. Lazy learning with information-theoretic similarity matching has appeared a salient approach, demonstrated to be superior over other machine-learning approaches in various comparative studies. It is asserted both in theoretical machine learning and in reports on applications of machine learning to natural language that the success of lazy learning may be due to the fact that language data contains small disjuncts, i.e., small clusters of identically-classified instances. We propose three measures to discover small disjuncts in our data: (i) we count and analyse indexed clusters of instances in induced decision trees; (ii) we count clusters of friendly (identically-classified) instances immediately surrounding instances by using similarity metrics from lazy learning; (iii) we compare average sizes of friendly-instance clusters using different similarity metrics. The measures are illustrated by a sample language task, viz. word pronunciation. Two conclusions are arrived at: (i) our data indeed contains large amounts of small disjuncts of about three to a hundred instances, and (ii) there are important differences in feature relevance in the data

Year: 2014
OAI identifier: oai:CiteSeerX.psu:
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • (external link)
  • (external link)
  • Suggested articles

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.