Search CORE

187 research outputs found

The Minkowski central partition as a pointer to a suitable distance exponent and consensus partitioning

Author: Andrei Shestakov
Arbelaitz
Bertoni
Boris Mirkin
Caliński
Chan
de Amorim
de Amorim
de Amorim
de Amorim
Field
Hadjitodorov
Hartigan
Huang
Huang
Huang
Hubert
Jain
Ji
Kuncheva
Legendre
MacQueen
Makarenkov
MATLAB
Milligan
Mirkin
Mirkin
Ng
Pividori
Pollard
Renato Cordeiro de Amorim
Rousseeuw
Saitou
Steinley
Steinley
Topchy
Vladimir Makarenkov
Von Luxburg
Yang
Publication venue: 'Elsevier BV'
Publication date: 02/02/2017
Field of study

The Minkowski weighted K-means (MWK-means) is a recently developed clustering algorithm capable of computing feature weights. The cluster-specific weights in MWK-means follow the intuitive idea that a feature with low variance should have a greater weight than a feature with high variance. The final clustering found by this algorithm depends on the selection of the Minkowski distance exponent. This paper explores the possibility of using the central Minkowski partition in the ensemble of all Minkowski partitions for selecting an optimal value of the Minkowski exponent. The central Minkowski partition appears to be also a good consensus partition. Furthermore, we discovered some striking correlation results between the Minkowski profile, defined as a mapping of the Minkowski exponent values into the average similarity values of the optimal Minkowski partitions, and the Adjusted Rand Index vectors resulting from the comparison of the obtained partitions to the ground truth. Our findings were confirmed by a series of computational experiments involving synthetic Gaussian clusters and real-world data

University of Essex Research Repository

Crossref

Birkbeck Institutional Research Online

University of Hertfordshire Research Archive

A clustering based approach to reduce feature redundancy

Author: Cordeiro De Amorim Renato
Mirkin Boris
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/02/2017
Field of study

This document is the Accepted Manuscript version of the following paper: Cordeiro de Amorim, R.,and Mirkin, B., ‘A clustering based approach to reduce feature redundancy’, in Proceedings, Andrzej M. J. Skulimowski and Janusz Kacprzyk, eds., Knowledge, Information and Creativity Support Systems: Recent Trends, Advances and Solutions, Selected papers from KICSS’2013 - 8th International Conference on Knowledge, Information, and Creativity Support Systems, Kraków, Poland, 7-9 November 2013. ISBN 978-3-319-19089-1, e-ISBN 978-3-319-19090-7. Available online at doi: 10.1007/978-3-319-19090-7. © Springer International Publishing Switzerland 2016.Research effort has recently focused on designing feature weighting clustering algorithms. These algorithms automatically calculate the weight of each feature, representing their degree of relevance, in a data set. However, since most of these evaluate one feature at a time they may have difficulties to cluster data sets containing features with similar information. If a group of features contain the same relevant information, these clustering algorithms set high weights to each feature in this group, instead of removing some because of their redundant nature. This paper introduces an unsupervised feature selection method that can be used in the data pre-processing step to reduce the number of redundant features in a data set. This method clusters similar features together and then selects a subset of representative features for each cluster. This selection is based on the maximum information compression index between each feature and its respective cluster centroid. We present an empirical validation for our method by comparing it with a popular unsupervised feature selection on three EEG data sets. We find that our method selects features that produce better cluster recovery, without the need for an extra user-defined parameter.Final Accepted Versio

University of Hertfordshire Research Archive

The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold

Author: Chaudhari Pratik
Griniasty Itay
Mao Jialin
Ramesh Rahul
Sethna James P.
Teoh Han Kheng
Transtrum Mark K.
Yang Rubing
Publication venue
Publication date: 14/06/2023
Field of study

We develop information-geometric techniques to analyze the trajectories of the predictions of deep networks during training. By examining the underlying high-dimensional probabilistic models, we reveal that the training process explores an effectively low-dimensional manifold. Networks with a wide range of architectures, sizes, trained using different optimization methods, regularization techniques, data augmentation techniques, and weight initializations lie on the same manifold in the prediction space. We study the details of this manifold to find that networks with different architectures follow distinguishable trajectories but other factors have a minimal influence; larger networks train along a similar manifold as that of smaller networks, just faster; and networks initialized at very different parts of the prediction space converge to the solution along a similar manifold

arXiv.org e-Print Archive

Removing redundant features via clustering : preliminary results in mental task separation

Author: Cordeiro De Amorim Renato
Mirkin Boris
Publication venue: Progress & Business Publishers
Publication date: 01/01/2013
Field of study

Recent clustering algorithms have been designed to take into account the degree of relevance of each feature, by automatically calculating their weights. However, as the tendency is to evaluate each feature at a time, these algorithms may have difficulties dealing with features containing similar information. Should this information be relevant, these algorithms would set high weights to all such features instead of removing some due to their redundant nature. In this paper we introduce an unsupervised feature selection method that targets redundant features. Our method clusters similar features together and selects a subset of representative features for each cluster. This selection is based on the maximum information compression index between each feature and its respective cluster centroid. We empirically validate out method by comparing with it with a popular unsupervised feature selection on three EEG data sets. We find that ours selects features that produce better cluster recovery, without the need for an extra user-defined parameterFinal Accepted Versio

University of Hertfordshire Research Archive

Fast Algorithms for Constructing Maximum Entropy Summary Trees

Author: J. Naudts
T. Landesberger von
T. Munzner
Publication venue
Publication date: 01/01/2014
Field of study

Karloff? and Shirley recently proposed summary trees as a new way to visualize large rooted trees (Eurovis 2013) and gave algorithms for generating a maximum-entropy k-node summary tree of an input n-node rooted tree. However, the algorithm generating optimal summary trees was only pseudo-polynomial (and worked only for integral weights); the authors left open existence of a olynomial-time algorithm. In addition, the authors provided an additive approximation algorithm and a greedy heuristic, both working on real weights. This paper shows how to construct maximum entropy k-node summary trees in time O(k^2 n + n log n) for real weights (indeed, as small as the time bound for the greedy heuristic given previously); how to speed up the approximation algorithm so that it runs in time O(n + (k^4/eps?) log(k/eps?)), and how to speed up the greedy algorithm so as to run in time O(kn + n log n). Altogether, these results make summary trees a much more practical tool than before.Comment: 17 pages, 4 figures. Extended version of paper appearing in ICALP 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Recovering the number of clusters in data sets with noise features using feature rescaling factors

Author: Arbelaitz
Ball
Bezdek
Caliński
Chan
Chiang
Chiang
Christian Hennig
David
de Amorim
de Amorim
de Amorim
Dudoit
Dunn
Gasch
Halkidi
Hartigan
Hennig
Huang
Huang
Hubert
Jain
Jain
Kaufman
MacQueen
Milligan
Mirkin
Pollard
Renato Cordeiro de Amorim
Rousseeuw
Steinley
Steinley
Sturn
Vedaldi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters. We experiment with the Silhouette (using squared Euclidean, Manhattan, and the pth power of the Minkowski distance), Dunn’s, Calinski–Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set.Peer reviewe

arXiv.org e-Print Archive

University of Essex Research Repository

Crossref

UCL Discovery

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

University of Hertfordshire Research Archive

Effective Spell Checking Methods Using Clustering Algorithms

Author: Cordeiro De Amorim Renato
Zampieri Marcos
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2013
Field of study

This paper presents a novel approach to spell checking using dictionary clustering. The main goal is to reduce the number of times distances have to be calculated when finding target words for misspellings. The method is unsupervised and combines the application of anomalous pattern initialization and partition around medoids (PAM). To evaluate the method, we used an English misspelling list compiled using real examples extracted from the Birkbeck spelling error corpus.Final Published versio

CiteSeerX

University of Hertfordshire Research Archive

Core clustering as a tool for tackling noise in cluster labels

Author: Cordeiro de Amorim Renato
Makarenkov Vladimir
Mirkin Boris
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2020
Field of study

Real-world data sets often contain mislabelled entities. This can be particularly problematic if the data set is being used by a supervised classification algorithm at its learning phase. In this case the accuracy of this classification algorithm, when applied to unlabelled data, is likely to suffer considerably. In this paper we introduce a clustering-based method capable of reducing the number of mislabelled entities in data sets. Our method can be summarised as follows: (i) cluster the data set; (ii) select the entities that have the most potential to be assigned to correct clusters; (iii) use the entities of the previous step to define the core clusters and map them to the labels using a confusion matrix; (iv) use the core clusters and our cluster membership criterion to correct the labels of the remaining entities. We perform numerous experiments to validate our method empirically using k-nearest neighbour classifiers as a benchmark. We experiment with both synthetic and real-world data sets with different proportions of mislabelled entities. Our experiments demonstrate that the proposed method produces promising results. Thus, it could be used as a pre-processing data correction step of a supervised machine learning algorithm

University of Essex Research Repository

Crossref