2,375 research outputs found
Empirical Comparative Analysis of 1-of-K Coding and K-Prototypes in Categorical Clustering
Clustering is a fundamental machine learning application, which partitions data into homogeneous groups. K-means and its variants are the most widely used class of clustering algorithms today. However, the original k-means algorithm can only be applied to numeric data. For categorical data, the data has to be converted into numeric data through 1-of-K coding which itself causes many problems. K-prototypes, another clustering algorithm that originates from the k-means algorithm, can handle categorical data by adopting a different notion of distance. In this paper, we systematically compare these two methods through an experimental analysis. Our analysis shows that K-prototypes is more suited when the dataset is large-scaled, while the performance of k-means with 1-of-K coding is more stable. We believe these are useful heuristics for clustering methods working with highly categorical data
GOGGLES: Automatic Image Labeling with Affinity Coding
Generating large labeled training data is becoming the biggest bottleneck in
building and deploying supervised machine learning models. Recently, the data
programming paradigm has been proposed to reduce the human cost in labeling
training data. However, data programming relies on designing labeling functions
which still requires significant domain expertise. Also, it is prohibitively
difficult to write labeling functions for image datasets as it is hard to
express domain knowledge using raw features for images (pixels).
We propose affinity coding, a new domain-agnostic paradigm for automated
training data labeling. The core premise of affinity coding is that the
affinity scores of instance pairs belonging to the same class on average should
be higher than those of pairs belonging to different classes, according to some
affinity functions. We build the GOGGLES system that implements affinity coding
for labeling image datasets by designing a novel set of reusable affinity
functions for images, and propose a novel hierarchical generative model for
class inference using a small development set.
We compare GOGGLES with existing data programming systems on 5 image labeling
tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a
minimum of 71% to a maximum of 98% without requiring any extensive human
annotation. In terms of end-to-end performance, GOGGLES outperforms the
state-of-the-art data programming system Snuba by 21% and a state-of-the-art
few-shot learning technique by 5%, and is only 7% away from the fully
supervised upper bound.Comment: Published at 2020 ACM SIGMOD International Conference on Management
of Dat
Kernel Metric Learning for Clustering Mixed-type Data
Distance-based clustering and classification are widely used in various
fields to group mixed numeric and categorical data. A predefined distance
measurement is used to cluster data points based on their dissimilarity. While
there exist numerous distance-based measures for data with pure numerical
attributes and several ordered and unordered categorical metrics, an optimal
distance for mixed-type data is an open problem. Many metrics convert numerical
attributes to categorical ones or vice versa. They handle the data points as a
single attribute type or calculate a distance between each attribute separately
and add them up. We propose a metric that uses mixed kernels to measure
dissimilarity, with cross-validated optimal kernel bandwidths. Our approach
improves clustering accuracy when utilized for existing distance-based
clustering algorithms on simulated and real-world datasets containing pure
continuous, categorical, and mixed-type data.Comment: 23 pages, 5 tables, 2 figure
Nonparametric Hierarchical Clustering of Functional Data
In this paper, we deal with the problem of curves clustering. We propose a
nonparametric method which partitions the curves into clusters and discretizes
the dimensions of the curve points into intervals. The cross-product of these
partitions forms a data-grid which is obtained using a Bayesian model selection
approach while making no assumptions regarding the curves. Finally, a
post-processing technique, aiming at reducing the number of clusters in order
to improve the interpretability of the clustering, is proposed. It consists in
optimally merging the clusters step by step, which corresponds to an
agglomerative hierarchical classification whose dissimilarity measure is the
variation of the criterion. Interestingly this measure is none other than the
sum of the Kullback-Leibler divergences between clusters distributions before
and after the merges. The practical interest of the approach for functional
data exploratory analysis is presented and compared with an alternative
approach on an artificial and a real world data set
A fuzzy taxonomy for e-Health projects
Evaluating the impact of Information Technology (IT) projects represents a problematic task for policy and decision makers aiming to define roadmaps based on previous experiences. Especially in the healthcare sector IT can support a wide range of processes and it is difficult to analyze in a comparative way the benefits and results of e-Health practices in order to define strategies and to assign priorities to potential investments. A first step towards the definition of an evaluation framework to compare e-Health initiatives consists in the definition of clusters of homogeneous projects that can be further analyzed through multiple case studies. However imprecision and subjectivity affect the classification of e-Health projects that are focused on multiple aspects of the complex healthcare system scenario. In this paper we apply a method, based on advanced cluster techniques and fuzzy theories, for validating a project taxonomy in the e-Health sector. An empirical test of the method has been performed over a set of European good practices in order to define a taxonomy for classifying e-Health projects.Evaluating the impact of Information Technology (IT) projects represents a problematic task for policy and decision makers aiming to define roadmaps based on previous experiences. Especially in the healthcare sector IT can support a wide range of processes and it is difficult to analyze in a comparative way the benefits and results of e-Health practices in order to define strategies and to assign priorities to potential investments. A first step towards the definition of an evaluation framework to compare e-Health initiatives consists in the definition of clusters of homogeneous projects that can be further analyzed through multiple case studies. However imprecision and subjectivity affect the classification of e-Health projects that are focused on multiple aspects of the complex healthcare system scenario. In this paper we apply a method, based on advanced cluster techniques and fuzzy theories, for validating a project taxonomy in the e-Health sector. An empirical test of the method has been performed over a set of European good practices in order to define a taxonomy for classifying e-Health projects.Articles published in or submitted to a Journal without IF refereed / of international relevanc
Language impairment and colour categories
Goldstein (1948) reported multiple cases of failure to categorise colours in patients that he termed amnesic or anomic aphasics. these patients have a particular difficulty in producing perceptual categories in the absence of other aphasic impairments. we hold that neuropsychological evidence supports the view that the task of colour categorisation is logically impossible without labels
Information Maximization Clustering via Multi-View Self-Labelling
Image clustering is a particularly challenging computer vision task, which
aims to generate annotations without human supervision. Recent advances focus
on the use of self-supervised learning strategies in image clustering, by first
learning valuable semantics and then clustering the image representations.
These multiple-phase algorithms, however, increase the computational time and
their final performance is reliant on the first stage. By extending the
self-supervised approach, we propose a novel single-phase clustering method
that simultaneously learns meaningful representations and assigns the
corresponding annotations. This is achieved by integrating a discrete
representation into the self-supervised paradigm through a classifier net.
Specifically, the proposed clustering objective employs mutual information, and
maximizes the dependency between the integrated discrete representation and a
discrete probability distribution. The discrete probability distribution is
derived though the self-supervised process by comparing the learnt latent
representation with a set of trainable prototypes. To enhance the learning
performance of the classifier, we jointly apply the mutual information across
multi-crop views. Our empirical results show that the proposed framework
outperforms state-of-the-art techniques with the average accuracy of 89.1% and
49.0%, respectively, on CIFAR-10 and CIFAR-100/20 datasets. Finally, the
proposed method also demonstrates attractive robustness to parameter settings,
making it ready to be applicable to other datasets
A Survey of Adaptive Resonance Theory Neural Network Models for Engineering Applications
This survey samples from the ever-growing family of adaptive resonance theory
(ART) neural network models used to perform the three primary machine learning
modalities, namely, unsupervised, supervised and reinforcement learning. It
comprises a representative list from classic to modern ART models, thereby
painting a general picture of the architectures developed by researchers over
the past 30 years. The learning dynamics of these ART models are briefly
described, and their distinctive characteristics such as code representation,
long-term memory and corresponding geometric interpretation are discussed.
Useful engineering properties of ART (speed, configurability, explainability,
parallelization and hardware implementation) are examined along with current
challenges. Finally, a compilation of online software libraries is provided. It
is expected that this overview will be helpful to new and seasoned ART
researchers
- …