Search CORE

97,051 research outputs found

CSAL: Self-adaptive Labeling based Clustering Integrating Supervised Learning on Unlabeled Data.

Author: Cao LONGBING
Li FANGFANG
Xu G
Publication venue
Publication date: 01/01/2015
Field of study

Supervised classification approaches can predict labels for unknown data because of the supervised training process. The success of classification is heavily dependent on the labeled training data. Differently, clustering is effective in revealing the aggregation property of unlabeled data, but the performance of most clustering methods is limited by the absence of labeled data. In real applications, however, it is time-consuming and sometimes impossible to obtain labeled data. The combination of clustering and classification is a promising and active approach which can largely improve the performance. In this paper, we propose an innovative and effective clustering framework based on self-adaptive labeling (CSAL) which integrates clustering and classification on unlabeled data. Clustering is first employed to partition data and a certain proportion of clustered data are selected by our proposed labeling approach for training classifiers. In order to refine the trained classifiers, an iterative process of Expectation-Maximization algorithm is devised into the proposed clustering framework CSAL. Experiments are conducted on publicly data sets to test different combinations of clustering algorithms and classification models as well as various training data labeling methods. The experimental results show that our approach along with the self-adaptive method outperforms other methods

arXiv.org e-Print Archive

OPUS - University of Technology Sydney

Fast Approximate $K$ -Means via Cluster Closures

Author: Ke Qifa
Li Shipeng
Wang Jing
Wang Jingdong
Zeng Gang
Publication venue
Publication date: 01/01/2012
Field of study

K

-means, a simple and effective clustering algorithm, is one of the most widely used algorithms in multimedia and computer vision community. Traditional

k

-means is an iterative algorithm---in each iteration new cluster centers are computed and each data point is re-assigned to its nearest center. The cluster re-assignment step becomes prohibitively expensive when the number of data points and cluster centers are large. In this paper, we propose a novel approximate

k

-means algorithm to greatly reduce the computational complexity in the assignment step. Our approach is motivated by the observation that most active points changing their cluster assignments at each iteration are located on or near cluster boundaries. The idea is to efficiently identify those active points by pre-assembling the data into groups of neighboring points using multiple random spatial partition trees, and to use the neighborhood information to construct a closure for each cluster, in such a way only a small number of cluster candidates need to be considered when assigning a data point to its nearest cluster. Using complexity analysis, image data clustering, and applications to image retrieval, we show that our approach out-performs state-of-the-art approximate

k

-means algorithms in terms of clustering quality and efficiency

arXiv.org e-Print Archive

Crossref

CSAL: Self-adaptive Labeling based Clustering Integrating Supervised Learning on Unlabeled Data

Author: Cao L
Li F
Xu G
Publication venue
Publication date
Field of study

OPUS - University of Technology Sydney

ACCAMS: Additive Co-Clustering to Approximate Matrices Succinctly

Author: Ahmed Amr
Beutel Alex
Smola Alexander J.
Publication venue
Publication date: 31/12/2014
Field of study

Matrix completion and approximation are popular tools to capture a user's preferences for recommendation and to approximate missing data. Instead of using low-rank factorization we take a drastically different approach, based on the simple insight that an additive model of co-clusterings allows one to approximate matrices efficiently. This allows us to build a concise model that, per bit of model learned, significantly beats all factorization approaches to matrix approximation. Even more surprisingly, we find that summing over small co-clusterings is more effective in modeling matrices than classic co-clustering, which uses just one large partitioning of the matrix. Following Occam's razor principle suggests that the simple structure induced by our model better captures the latent preferences and decision making processes present in the real world than classic co-clustering or matrix factorization. We provide an iterative minimization algorithm, a collapsed Gibbs sampler, theoretical guarantees for matrix approximation, and excellent empirical evidence for the efficacy of our approach. We achieve state-of-the-art results on the Netflix problem with a fraction of the model complexity.Comment: 22 pages, under review for conference publicatio

arXiv.org e-Print Archive

CiteSeerX

A clustering algorithm based on fitness probability scores for cluster centers optimization

Author: Costa M. Fernanda P.
Fernandes Edite M. G. P.
Rocha Ana Maria A. C.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

In the present paper, we propose an iterative clustering approach that sequentially applies five processes, namely: the assign, delete, split, delete and optimization. It is based on the fitness probability scores of the cluster centers to identify the least fitted centers to undergo an optimization process, aiming to improve the centers from one iteration to another. Moreover, the parameters of the algorithm for the delete, split and optimization processes are dynamically tuned as problem dependent functions. The presented clustering algorithm is evaluated using four data sets, two randomly generated and two well-known sets. The obtained clustering algorithm is compared with other clustering algorithms through the visualization of the clustering, the value of a validity measure and the value of the objective function of the optimization process. The comparison of results shows that the proposed clustering algorithm is effective and robust.This work has been supported by FCT -Fundacao para a Ciencia e Tecnologia within the R&D Units Project Scope: UIDB/00013/2020 and UIDP/00013/2020 of CMATUM

Universidade do Minho: RepositoriUM

A cluster-based simulation of facet-based search

Author: Gildea N.
Hopfgartner F.
Jose J.M.
Urruty T.
Villa R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2008
Field of study

The recent increase of online video has challenged the research in the field of video information retrieval. Video search engines are becoming more and more interactive, helping the user to easily find what he or she is looking for. In this poster, we present a new approach of using an iterative clustering algorithm on text and visual features to simulate users creating new facets in a facet-based interface. Our experimental results prove the usefulness of such an approach

Crossref

Enlighten

CEIL: A General Classification-Enhanced Iterative Learning Framework for Text Clustering

Author: Ma Yinglong
Niu Di
Wang Mengzhen
Wu Haijiang
Zhao Mingjun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/04/2023
Field of study

Text clustering, as one of the most fundamental challenges in unsupervised learning, aims at grouping semantically similar text segments without relying on human annotations. With the rapid development of deep learning, deep clustering has achieved significant advantages over traditional clustering methods. Despite the effectiveness, most existing deep text clustering methods rely heavily on representations pre-trained in general domains, which may not be the most suitable solution for clustering in specific target domains. To address this issue, we propose CEIL, a novel Classification-Enhanced Iterative Learning framework for short text clustering, which aims at generally promoting the clustering performance by introducing a classification objective to iteratively improve feature representations. In each iteration, we first adopt a language model to retrieve the initial text representations, from which the clustering results are collected using our proposed Category Disentangled Contrastive Clustering (CDCC) algorithm. After strict data filtering and aggregation processes, samples with clean category labels are retrieved, which serve as supervision information to update the language model with the classification objective via a prompt learning approach. Finally, the updated language model with improved representation ability is used to enhance clustering in the next iteration. Extensive experiments demonstrate that the CEIL framework significantly improves the clustering performance over iterations, and is generally effective on various clustering algorithms. Moreover, by incorporating CEIL on CDCC, we achieve the state-of-the-art clustering performance on a wide range of short text clustering benchmarks outperforming other strong baseline methods.Comment: The Web Conference 202

arXiv.org e-Print Archive