13,086 research outputs found

    Semantic preserving text tepresentation and its applications in text clustering

    Get PDF
    Text mining using the vector space representation has proven to be an valuable tool for classification, prediction, information retrieval and extraction. The nature of text data presents several issues to these tasks, including large dimension and the existence of special polysemous and synonymous words. A variety of techniques have been devised to overcome these shortcomings, including feature selection and word sense disambiguation. Privacy preserving data mining is also an area of emerging interest. Existing techniques for privacy preserving data mining require the use of secure computation protocols, which often incur a greatly increased computational cost. In this paper, a generalization-based method is presented for creating a semantic-preserving vector space which reduces dimension as well as addresses problems with special word types. The SPVSM also allows private text data to be safely represented without degrading cluster accuracy or performance. Further, the result produced is also usable in combination with theoretic based techniques such as latent semantic indexing. The performance of text clustering using the semantic preserving generalization method is evaluated and compared to existing feature selection techniques, and shown to have significant merit from a clustering perspective

    Multi-mode partitioning for text clustering to reduce dimensionality and noises

    Get PDF
    Co-clustering in text mining has been proposed to partition words and documents simultaneously. Although the main advantage of this approach may improve interpretation of clusters on the data, there are still few proposals on these methods; while one-way partition is even now widely utilized for information retrieval. In contrast to structured information, textual data suffer of high dimensionality and sparse matrices, so it is strictly necessary to pre-process texts for applying clustering techniques. In this paper, we propose a new procedure to reduce high dimensionality of corpora and to remove the noises from the unstructured data. We test two different processes to treat data applying two co-clustering algorithms; based on the results we present the procedure that provides the best interpretation of the data

    Efficient Information Theoretic Clustering on Discrete Lattices

    Full text link
    We consider the problem of clustering data that reside on discrete, low dimensional lattices. Canonical examples for this setting are found in image segmentation and key point extraction. Our solution is based on a recent approach to information theoretic clustering where clusters result from an iterative procedure that minimizes a divergence measure. We replace costly processing steps in the original algorithm by means of convolutions. These allow for highly efficient implementations and thus significantly reduce runtime. This paper therefore bridges a gap between machine learning and signal processing.Comment: This paper has been presented at the workshop LWA 201

    Deep Divergence-Based Approach to Clustering

    Get PDF
    A promising direction in deep learning research consists in learning representations and simultaneously discovering cluster structure in unlabeled data by optimizing a discriminative loss function. As opposed to supervised deep learning, this line of research is in its infancy, and how to design and optimize suitable loss functions to train deep neural networks for clustering is still an open question. Our contribution to this emerging field is a new deep clustering network that leverages the discriminative power of information-theoretic divergence measures, which have been shown to be effective in traditional clustering. We propose a novel loss function that incorporates geometric regularization constraints, thus avoiding degenerate structures of the resulting clustering partition. Experiments on synthetic benchmarks and real datasets show that the proposed network achieves competitive performance with respect to other state-of-the-art methods, scales well to large datasets, and does not require pre-training steps
    • …
    corecore