87,514 research outputs found
Constrained K-Means Clustering Validation Study
Machine Learning (ML) is a growing topic within Computer Science with applications in many fields. One open problem in ML is data separation, or data clustering. Our project is a validation study of, “Constrained K-means Clustering with Background Knowledge by Wagstaff et. al. Our data validates the finding by Wagstaff et. al., which shows that a modified k-means clustering approach can outperform more general unsupervised learning algorithms when some domain information about the problem is available. Our data suggests that k-means clustering augmented with domain information can be a time efficient means for segmenting data sets. Our validation study focused on six classic data sets used by Wagstaff et. al. and does not consider the GPS data of the original study
An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering
The minimum sum-of-squares clustering (MSSC), or k-means type clustering, is
traditionally considered an unsupervised learning task. In recent years, the
use of background knowledge to improve the cluster quality and promote
interpretability of the clustering process has become a hot research topic at
the intersection of mathematical optimization and machine learning research.
The problem of taking advantage of background information in data clustering is
called semi-supervised or constrained clustering. In this paper, we present a
branch-and-cut algorithm for semi-supervised MSSC, where background knowledge
is incorporated as pairwise must-link and cannot-link constraints. For the
lower bound procedure, we solve the semidefinite programming relaxation of the
MSSC discrete optimization model, and we use a cutting-plane procedure for
strengthening the bound. For the upper bound, instead, by using integer
programming tools, we use an adaptation of the k-means algorithm to the
constrained case. For the first time, the proposed global optimization
algorithm efficiently manages to solve real-world instances up to 800 data
points with different combinations of must-link and cannot-link constraints and
with a generic number of features. This problem size is about four times larger
than the one of the instances solved by state-of-the-art exact algorithms
Constrained \u3ci\u3ek\u3c/i\u3e-Means Clustering Validation Study
Machine Learning (ML) is a growing topic within Computer Science with applications in many fields. One open problem in ML is data separation, or data clustering. Our project is a validation study of, “Constrained k-means Clustering with Background Knowledge by Wagstaff et. al. Our data validates the finding by Wagstaff et. al., which shows that a modified k-means clustering approach can outperform more general unsupervised learning algorithms when some domain information about the problem is available. Our data suggests that k-means clustering augmented with domain information can be a time efficient means for segmenting data sets. Our validation study focused on six classic data sets used by Wagstaff et. al. and does not consider the GPS data of the original study. We have published our code on a public SWOSU Github repository to enable other researchers to use our code as a starting point. Validation studies such as this provide great learning opportunities for students interested in working with Machine Learning, Artificial Intelligence, and other related applications. This research was funded in part by the Dr. Snowden Memorial Scholarship with the NASA OKLAHOMA Space Grant Consortium. This material is based upon work supported by the National Aeronautics and Space Administration issued through the Oklahoma Space Grant Consortium
Community Structure Detection in Complex Networks with Partial Background Information
Constrained clustering has been well-studied in the unsupervised learning
society. However, how to encode constraints into community structure detection,
within complex networks, remains a challenging problem. In this paper, we
propose a semi-supervised learning framework for community structure detection.
This framework implicitly encodes the must-link and cannot-link constraints by
modifying the adjacency matrix of network, which can also be regarded as
de-noising the consensus matrix of community structures. Our proposed method
gives consideration to both the topology and the functions (background
information) of complex network, which enhances the interpretability of the
results. The comparisons performed on both the synthetic benchmarks and the
real-world networks show that the proposed framework can significantly improve
the community detection performance with few constraints, which makes it an
attractive methodology in the analysis of complex networks
Multi-view constrained clustering with an incomplete mapping between views
Multi-view learning algorithms typically assume a complete bipartite mapping
between the different views in order to exchange information during the
learning process. However, many applications provide only a partial mapping
between the views, creating a challenge for current methods. To address this
problem, we propose a multi-view algorithm based on constrained clustering that
can operate with an incomplete mapping. Given a set of pairwise constraints in
each view, our approach propagates these constraints using a local similarity
measure to those instances that can be mapped to the other views, allowing the
propagated constraints to be transferred across views via the partial mapping.
It uses co-EM to iteratively estimate the propagation within each view based on
the current clustering model, transfer the constraints across views, and then
update the clustering model. By alternating the learning process between views,
this approach produces a unified clustering model that is consistent with all
views. We show that this approach significantly improves clustering performance
over several other methods for transferring constraints and allows multi-view
clustering to be reliably applied when given a limited mapping between the
views. Our evaluation reveals that the propagated constraints have high
precision with respect to the true clusters in the data, explaining their
benefit to clustering performance in both single- and multi-view learning
scenarios
Semi-supervised cross-entropy clustering with information bottleneck constraint
In this paper, we propose a semi-supervised clustering method, CEC-IB, that
models data with a set of Gaussian distributions and that retrieves clusters
based on a partial labeling provided by the user (partition-level side
information). By combining the ideas from cross-entropy clustering (CEC) with
those from the information bottleneck method (IB), our method trades between
three conflicting goals: the accuracy with which the data set is modeled, the
simplicity of the model, and the consistency of the clustering with side
information. Experiments demonstrate that CEC-IB has a performance comparable
to Gaussian mixture models (GMM) in a classical semi-supervised scenario, but
is faster, more robust to noisy labels, automatically determines the optimal
number of clusters, and performs well when not all classes are present in the
side information. Moreover, in contrast to other semi-supervised models, it can
be successfully applied in discovering natural subgroups if the partition-level
side information is derived from the top levels of a hierarchical clustering
- …