Search CORE

1 research outputs found

Unsupervised instance selection from text streams

Author: Bonin Rafael
Marcacini Ricardo M.
Rezende Solange Oliveira
Publication venue: Porto Alegre
Publication date
Field of study

Instance selection techniques have received great attention in the literature, since they are very useful to identify a subset of instances (textual documents) that adequately represents the knowledge embedded in the entire text database. Most of the instance selection techniques are supervised, i.e., requires a labeled data set to define, with the help of classifiers, the separation boundaries of the data. However, manual labeling of the instances requires an intense human effort that is impractical when dealing with text streams. In this article, we present an approach for unsupervised instance selection from text streams. In our approach, text clustering methods are used to define the separation boundaries, thereby separating regions of high data density. The most representative instances of each cluster, which are the centers of high-density regions, are selected to represent a portion of the data. A well-known algorithm for data sampling from streams, known as Reservoir Sampling, has been adapted to incorporate the unsupervised instance selection. We carried out an experimental evaluations using three benchmarking text collections and the reported experimental results show that the proposed approach significantly increases the quality of a knowledge extraction task by using more representative instances.FAPESP - São Paulo Research Foundation (grant 2010/20564-8)CAPESCNPq1st. Symposium on Knowledge Discovery, Mining and Learning (KDMiLe).\ud São Carlos, Brazil. 17-19 July 2013