18 research outputs found
Ultra-Scalable Spectral Clustering and Ensemble Clustering
This paper focuses on scalability and robustness of spectral clustering for
extremely large-scale datasets with limited resources. Two novel algorithms are
proposed, namely, ultra-scalable spectral clustering (U-SPEC) and
ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative
selection strategy and a fast approximation method for K-nearest
representatives are proposed for the construction of a sparse affinity
sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the
transfer cut is then utilized to efficiently partition the graph and obtain the
clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated
into an ensemble clustering framework to enhance the robustness of U-SPEC while
maintaining high efficiency. Based on the ensemble generation via multiple
U-SEPC's, a new bipartite graph is constructed between objects and base
clusters and then efficiently partitioned to achieve the consensus clustering
result. It is noteworthy that both U-SPEC and U-SENC have nearly linear time
and space complexity, and are capable of robustly and efficiently partitioning
ten-million-level nonlinearly-separable datasets on a PC with 64GB memory.
Experiments on various large-scale datasets have demonstrated the scalability
and robustness of our algorithms. The MATLAB code and experimental data are
available at https://www.researchgate.net/publication/330760669.Comment: To appear in IEEE Transactions on Knowledge and Data Engineering,
201
Utilizing Mixture Regression Models for Clustering Time-Series Energy Consumption of a Plastic Injection Molding Process
Considering the issue of energy consumption reduction in industrial plants, we investigated a clustering method for mining the time-series data related to energy consumption. The industrial case study considered in our work is one of the most energy-intensive processes in the plastics industry: the plastic injection molding process. Concerning the industrial setting, the energy consumption of the injection molding machine was monitored across multiple injection molding cycles. The collected data were then analyzed to establish patterns and trends in the energy consumption of the injection molding process. To this end, we considered mixtures of regression models given their flexibility in modeling heterogeneous time series and clustering time series in an unsupervised machine learning framework. Given the assumption of autocorrelated data and exogenous variables in the mixture model, we implemented an algorithm for model fitting that combined autocorrelated observations with spline and polynomial regressions. Our results demonstrate an accurate grouping of energy-consumption profiles, where each cluster is related to a specific production schedule. The clustering method also provides a unique profile of energy consumption for each cluster, depending on the production schedule and regression approach (i.e., spline and polynomial). According to these profiles, information related to the shape of energy consumption was identified, providing insights into reducing the electrical demand of the plant
Efficient Multi-View Graph Clustering with Local and Global Structure Preservation
Anchor-based multi-view graph clustering (AMVGC) has received abundant
attention owing to its high efficiency and the capability to capture
complementary structural information across multiple views. Intuitively, a
high-quality anchor graph plays an essential role in the success of AMVGC.
However, the existing AMVGC methods only consider single-structure information,
i.e., local or global structure, which provides insufficient information for
the learning task. To be specific, the over-scattered global structure leads to
learned anchors failing to depict the cluster partition well. In contrast, the
local structure with an improper similarity measure results in potentially
inaccurate anchor assignment, ultimately leading to sub-optimal clustering
performance. To tackle the issue, we propose a novel anchor-based multi-view
graph clustering framework termed Efficient Multi-View Graph Clustering with
Local and Global Structure Preservation (EMVGC-LG). Specifically, a unified
framework with a theoretical guarantee is designed to capture local and global
information. Besides, EMVGC-LG jointly optimizes anchor construction and graph
learning to enhance the clustering quality. In addition, EMVGC-LG inherits the
linear complexity of existing AMVGC methods respecting the sample number, which
is time-economical and scales well with the data size. Extensive experiments
demonstrate the effectiveness and efficiency of our proposed method.Comment: arXiv admin note: text overlap with arXiv:2308.1654
SE-shapelets: Semi-supervised Clustering of Time Series Using Representative Shapelets
Shapelets that discriminate time series using local features (subsequences)
are promising for time series clustering. Existing time series clustering
methods may fail to capture representative shapelets because they discover
shapelets from a large pool of uninformative subsequences, and thus result in
low clustering accuracy. This paper proposes a Semi-supervised Clustering of
Time Series Using Representative Shapelets (SE-Shapelets) method, which
utilizes a small number of labeled and propagated pseudo-labeled time series to
help discover representative shapelets, thereby improving the clustering
accuracy. In SE-Shapelets, we propose two techniques to discover representative
shapelets for the effective clustering of time series. 1) A \textit{salient
subsequence chain} () that can extract salient subsequences (as candidate
shapelets) of a labeled/pseudo-labeled time series, which helps remove massive
uninformative subsequences from the pool. 2) A \textit{linear discriminant
selection} () algorithm to identify shapelets that can capture
representative local features of time series in different classes, for
convenient clustering. Experiments on UCR time series datasets demonstrate that
SE-shapelets discovers representative shapelets and achieves higher clustering
accuracy than counterpart semi-supervised time series clustering methods
Image Clustering with Contrastive Learning and Multi-scale Graph Convolutional Networks
Deep clustering has recently attracted significant attention. Despite the
remarkable progress, most of the previous deep clustering works still suffer
from two limitations. First, many of them focus on some distribution-based
clustering loss, lacking the ability to exploit sample-wise (or
augmentation-wise) relationships via contrastive learning. Second, they often
neglect the indirect sample-wise structure information, overlooking the rich
possibilities of multi-scale neighborhood structure learning. In view of this,
this paper presents a new deep clustering approach termed Image clustering with
contrastive learning and multi-scale Graph Convolutional Networks (IcicleGCN),
which bridges the gap between convolutional neural network (CNN) and graph
convolutional network (GCN) as well as the gap between contrastive learning and
multi-scale neighborhood structure learning for the image clustering task. The
proposed IcicleGCN framework consists of four main modules, namely, the
CNN-based backbone, the Instance Similarity Module (ISM), the Joint Cluster
Structure Learning and Instance reconstruction Module (JC-SLIM), and the
Multi-scale GCN module (M-GCN). Specifically, with two random augmentations
performed on each image, the backbone network with two weight-sharing views is
utilized to learn the representations for the augmented samples, which are then
fed to ISM and JC-SLIM for instance-level and cluster-level contrastive
learning, respectively. Further, to enforce multi-scale neighborhood structure
learning, two streams of GCNs and an auto-encoder are simultaneously trained
via (i) the layer-wise interaction with representation fusion and (ii) the
joint self-adaptive learning that ensures their last-layer output distributions
to be consistent. Experiments on multiple image datasets demonstrate the
superior clustering performance of IcicleGCN over the state-of-the-art
DeepCluE: Enhanced Image Clustering via Multi-layer Ensembles in Deep Neural Networks
Deep clustering has recently emerged as a promising technique for complex
data clustering. Despite the considerable progress, previous deep clustering
works mostly build or learn the final clustering by only utilizing a single
layer of representation, e.g., by performing the K-means clustering on the last
fully-connected layer or by associating some clustering loss to a specific
layer, which neglect the possibilities of jointly leveraging multi-layer
representations for enhancing the deep clustering performance. In view of this,
this paper presents a Deep Clustering via Ensembles (DeepCluE) approach, which
bridges the gap between deep clustering and ensemble clustering by harnessing
the power of multiple layers in deep neural networks. In particular, we utilize
a weight-sharing convolutional neural network as the backbone, which is trained
with both the instance-level contrastive learning (via an instance projector)
and the cluster-level contrastive learning (via a cluster projector) in an
unsupervised manner. Thereafter, multiple layers of feature representations are
extracted from the trained network, upon which the ensemble clustering process
is further conducted. Specifically, a set of diversified base clusterings are
generated from the multi-layer representations via a highly efficient
clusterer. Then the reliability of clusters in multiple base clusterings is
automatically estimated by exploiting an entropy-based criterion, based on
which the set of base clusterings are re-formulated into a weighted-cluster
bipartite graph. By partitioning this bipartite graph via transfer cut, the
final consensus clustering can be obtained. Experimental results on six image
datasets confirm the advantages of DeepCluE over the state-of-the-art deep
clustering approaches.Comment: To appear in IEEE Transactions on Emerging Topics in Computational
Intelligenc
MeanCut: A Greedy-Optimized Graph Clustering via Path-based Similarity and Degree Descent Criterion
As the most typical graph clustering method, spectral clustering is popular
and attractive due to the remarkable performance, easy implementation, and
strong adaptability. Classical spectral clustering measures the edge weights of
graph using pairwise Euclidean-based metric, and solves the optimal graph
partition by relaxing the constraints of indicator matrix and performing
Laplacian decomposition. However, Euclidean-based similarity might cause skew
graph cuts when handling non-spherical data distributions, and the relaxation
strategy introduces information loss. Meanwhile, spectral clustering requires
specifying the number of clusters, which is hard to determine without enough
prior knowledge. In this work, we leverage the path-based similarity to enhance
intra-cluster associations, and propose MeanCut as the objective function and
greedily optimize it in degree descending order for a nondestructive graph
partition. This algorithm enables the identification of arbitrary shaped
clusters and is robust to noise. To reduce the computational complexity of
similarity calculation, we transform optimal path search into generating the
maximum spanning tree (MST), and develop a fast MST (FastMST) algorithm to
further improve its time-efficiency. Moreover, we define a density gradient
factor (DGF) for separating the weakly connected clusters. The validity of our
algorithm is demonstrated by testifying on real-world benchmarks and
application of face recognition. The source code of MeanCut is available at
https://github.com/ZPGuiGroupWhu/MeanCut-Clustering.Comment: 17 pages, 8 figures, 6 table
Fast Clustering Using a Grid-Based Underlying Density Function Approximation
Clustering is an unsupervised machine learning task that seeks to partition a set of data into smaller groupings, referred to as “clusters”, where items within the same cluster are somehow alike, while differing from those in other clusters. There are many different algorithms for clustering, but many of them are overly complex and scale poorly with larger data sets. In this paper, a new algorithm for clustering is proposed to solve some of these issues. Density-based clustering algorithms use a concept called the “underlying density function”, which is a conceptual higher-dimension function that describes the possible results from the continuous data set that our input data is just a discrete sample of. The algorithm proposed in this paper seeks to use this concept by creating a piecewise approximation of the underlying density function, and then merging points towards local density maxima from this higher-dimensioned space. First, the data space is divided into a grid-based structure and the density of each grid is calculated. Second, each of these “grid-squares” determines the densest space in its local area. Finally, the grid squares are merged together in the direction of their local density maximum, ultimately merging with one of the density maxima that form the root of a cluster. The experimental results show significant time improvements over standard algorithms such as DBSCAN with no accuracy penalty. Furthermore, the algorithm is also suitable for use with parallel and distributed systems, as an implementation with Apache Spark showed proper parallel scaling with low data set sizes required to overtake the serial implementation
Joint Multi-view Unsupervised Feature Selection and Graph Learning
Despite the recent progress, the existing multi-view unsupervised feature
selection methods mostly suffer from two limitations. First, they generally
utilize either cluster structure or similarity structure to guide the feature
selection, neglecting the possibility of a joint formulation with mutual
benefits. Second, they often learn the similarity structure by either global
structure learning or local structure learning, lacking the capability of graph
learning with both global and local structural awareness. In light of this,
this paper presents a joint multi-view unsupervised feature selection and graph
learning (JMVFG) approach. Particularly, we formulate the multi-view feature
selection with orthogonal decomposition, where each target matrix is decomposed
into a view-specific basis matrix and a view-consistent cluster indicator.
Cross-space locality preservation is incorporated to bridge the cluster
structure learning in the projected space and the similarity learning (i.e.,
graph learning) in the original space. Further, a unified objective function is
presented to enable the simultaneous learning of the cluster structure, the
global and local similarity structures, and the multi-view consistency and
inconsistency, upon which an alternating optimization algorithm is developed
with theoretically proved convergence. Extensive experiments demonstrate the
superiority of our approach for both multi-view feature selection and graph
learning tasks