Machine learning is becoming recognised as a source of generic and powerful tools for tasks studied and implemented in language technology. Lazy learning with information-theoretic similarity matching has appeared a salient approach, demonstrated to be superior over other machine-learning approaches in various comparative studies. It is asserted both in theoretical machine learning and in reports on applications of machine learning to natural language that the success of lazy learning may be due to the fact that language data contains small disjuncts, i.e., small clusters of identically-classified instances. We propose three measures to discover small disjuncts in our data: (i) we count and analyse indexed clusters of instances in induced decision trees; (ii) we count clusters of friendly (identically-classified) instances immediately surrounding instances by using similarity metrics from lazy learning; (iii) we compare average sizes of friendly-instance clusters using different similarity metrics. The measures are illustrated by a sample language task, viz. word pronunciation. Two conclusions are arrived at: (i) our data indeed contains large amounts of small disjuncts of about three to a hundred instances, and (ii) there are important differences in feature relevance in the data
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.