66,478 research outputs found
Recommended from our members
Learning with Partial Supervision for Clustering and Classification
In the field of machine learning, clustering and classification are two fundamental tasks. Traditionally, clustering is an unsupervised method, where no supervision about the data is available for learning; classification is a supervised task, where fully-labeled data are collected for training a classifier. In some scenarios, however, we may not have the full label but only partial supervision about the data, such as instance similarities or incomplete label assignments. In such cases, traditional clustering and classification methods do not directly apply. To address such problems, this thesis focuses on the task of learning from partial supervision for clustering and classification tasks. For clustering with partial supervision, we investigate three problems: a) constrained clustering in multi-instance multi-label learning, where the goal is to group instances into clusters that respect the background knowledge given by the bag-level labels; b) clustering with constraints, where the partial supervision is expressed as "pairwise constraints" or "relative constraints", regarding similarities about instance pairs and triplets respectively; c) active learning of pairwise constraints for clustering, where the goal is to improve the clustering with minimum human effort by iteratively querying the most informative pairs to an oracle. For classification with partial supervision, we address the problem of multi-label learning where data is associated with a latent label hierarchy and incomplete label assignments, and the goal is to simultaneously discover the latent hierarchy as well as to learn a multi-label classifier that is consistent with the hierarchy.Keywords: Classification, Partial Supervision, Active Learning, Clusterin
Data Clustering and Partial Supervision with Some Parallel Developments
Data Clustering and Partial Supell'ision with SOllie Parallel Developments
by Sameh A. Salem
Clustering is an important and irreplaceable step towards the search for structures in the
data. Many different clustering algorithms have been proposed. Yet, the sources of variability
in most clustering algorithms affect the reliability of their results. Moreover, the
majority tend to be based on the knowledge of the number of clusters as one of the input
parameters. Unfortunately, there are many scenarios, where this knowledge may not be
available. In addition, clustering algorithms are very computationally intensive which leads
to a major challenging problem in scaling up to large datasets. This thesis gives possible
solutions for such problems.
First, new measures - called clustering performance measures (CPMs) - for assessing
the reliability of a clustering algorithm are introduced. These CPMs can be used to evaluate:
I) clustering algorithms that have a structure bias to certain type of data distribution
as well as those that have no such biases, 2) clustering algorithms that have initialisation
dependency as well as the clustering algorithms that have a unique solution for a given set
of parameter values with no initialisation dependency.
Then, a novel clustering algorithm, which is a RAdius based Clustering ALgorithm
(RACAL), is proposed. RACAL uses a distance based principle to map the distributions of
the data assuming that clusters are determined by a distance parameter, without having to
specify the number of clusters. Furthermore, RACAL is enhanced by a validity index to
choose the best clustering result, i.e. result has compact clusters with wide cluster separations,
for a given input parameter. Comparisons with other clustering algorithms indicate
the applicability and reliability of the proposed clustering algorithm. Additionally, an adaptive
partial supervision strategy is proposed for using in conjunction with RACAL_to make
it act as a classifier. Results from RACAL with partial supervision, RACAL-PS, indicate
its robustness in classification. Additionally, a parallel version of RACAL (P-RACAL) is
proposed. The parallel evaluations of P-RACAL indicate that P-RACAL is scalable in terms
of speedup and scaleup, which gives the ability to handle large datasets of high dimensions
in a reasonable time.
Next, a novel clustering algorithm, which achieves clustering without any control of
cluster sizes, is introduced. This algorithm, which is called Nearest Neighbour Clustering,
Algorithm (NNCA), uses the same concept as the K-Nearest Neighbour (KNN) classifier
with the advantage that the algorithm needs no training set and it is completely unsupervised.
Additionally, NNCA is augmented with a partial supervision strategy, NNCA-PS, to
act as a classifier. Comparisons with other methods indicate the robustness of the proposed
method in classification. Additionally, experiments on parallel environment indicate the
suitability and scalability of the parallel NNCA, P-NNCA, in handling large datasets.
Further investigations on more challenging data are carried out. In this context, microarray
data is considered. In such data, the number of clusters is not clearly defined.
This points directly towards the clustering algorithms that does not require the knowledge
of the number of clusters. Therefore, the efficacy of one of these algorithms is examined.
Finally, a novel integrated clustering performance measure (lCPM) is proposed to be used
as a guideline for choosing the proper clustering algorithm that has the ability to extract
useful biological information in a particular dataset.
Supplied by The British Library - 'The world's knowledge'
Supplied by The British Library - 'The world's knowledge
Recommended from our members
Orderly Subspace Clustering
Semi-supervised representation-based subspace clustering is
to partition data into their underlying subspaces by finding
effective data representations with partial supervisions. Essentially, an effective and accurate representation should be
able to uncover and preserve the true data structure. Meanwhile, a reliable and easy-to-obtain supervision is desirable
for practical learning. To meet these two objectives, in this
paper we make the first attempt towards utilizing the orderly relationship, such as the data a is closer to b than to c, as
a novel supervision. We propose an orderly subspace clustering approach with a novel regularization term. OSC enforces the learned representations to simultaneously capture
the intrinsic subspace structure and reveal orderly structure
that is faithful to true data relationship. Experimental results
with several benchmarks have demonstrated that aside from
more accurate clustering against state-of-the-arts, OSC interprets orderly data structure which is beyond what current approaches can offer
Adaptive constrained clustering with application to dynamic image database categorization and visualization.
The advent of larger storage spaces, affordable digital capturing devices, and an ever growing online community dedicated to sharing images has created a great need for efficient analysis methods. In fact, analyzing images for the purpose of automatic categorization and retrieval is quickly becoming an overwhelming task even for the casual user. Initially, systems designed for these applications relied on contextual information associated with images. However, it was realized that this approach does not scale to very large data sets and can be subjective. Then researchers proposed methods relying on the content of the images. This approach has also proved to be limited due to the semantic gap between the low-level representation of the image and the high-level user perception. In this dissertation, we introduce a novel clustering technique that is designed to combine multiple forms of information in order to overcome the disadvantages observed while using a single information domain. Our proposed approach, called Adaptive Constrained Clustering (ACC), is a robust, dynamic, and semi-supervised algorithm. It is based on minimizing a single objective function incorporating the abilities to: (i) use multiple feature subsets while learning cluster independent feature relevance weights; (ii) search for the optimal number of clusters; and (iii) incorporate partial supervision in the form of pairwise constraints. The content of the images is used to extract the features used in the clustering process. The context information is used in constructing a set of appropriate constraints. These constraints are used as partial supervision information to guide the clustering process. The ACC algorithm is dynamic in the sense that the number of categories are allowed to expand and contract depending on the distribution of the data and the available set of constraints. We show that the proposed ACC algorithm is able to partition a given data set into meaningful clusters using an adaptive, soft constraint satisfaction methodology for the purpose of automatically categorizing and summarizing an image database. We show that the ACC algorithm has the ability to incorporate various types of contextual information. This contextual information includes: spatial information provided by geo-referenced images that include GPS coordinates pinpointing their location, temporal information provided by each image\u27s time stamp indicating the capture time, and textual information provided by a set of keywords describing the semantics of the associated images
Fast Gaussian Pairwise Constrained Spectral Clustering
International audienceWe consider the problem of spectral clustering with partial supervision in the form of must-link and cannot-link constraints. Such pairwise constraints are common in problems like coreference resolution in natural language processing. The approach developed in this paper is to learn a new representation space for the data together with a dis-tance in this new space. The representation space is obtained through a constraint-driven linear transformation of a spectral embedding of the data. Constraints are expressed with a Gaussian function that locally reweights the similarities in the projected space. A global, non-convex optimization objective is then derived and the model is learned via gradi-ent descent techniques. Our algorithm is evaluated on standard datasets and compared with state of the art algorithms, like [14,18,31]. Results on these datasets, as well on the CoNLL-2012 coreference resolution shared task dataset, show that our algorithm significantly outperforms related approaches and is also much more scalable
- …