5 research outputs found

    Centroid-Based Clustering with ab-Divergences

    Get PDF
    Centroid-based clustering is a widely used technique within unsupervised learning algorithms in many research fields. The success of any centroid-based clustering relies on the choice of the similarity measure under use. In recent years, most studies focused on including several divergence measures in the traditional hard k-means algorithm. In this article, we consider the problem of centroid-based clustering using the family of ab-divergences, which is governed by two parameters, a and b. We propose a new iterative algorithm, ab-k-means, giving closed-form solutions for the computation of the sided centroids. The algorithm can be fine-tuned by means of this pair of values, yielding a wide range of the most frequently used divergences. Moreover, it is guaranteed to converge to local minima for a wide range of values of the pair (a, b). Our theoretical contribution has been validated by several experiments performed with synthetic and real data and exploring the (a, b) plane. The numerical results obtained confirm the quality of the algorithm and its suitability to be used in several practical applications.MINECO TEC2017-82807-

    Centroid-Based Clustering with αβ-Divergences

    Get PDF
    Article number 196Centroid-based clustering is a widely used technique within unsupervised learning algorithms in many research fields. The success of any centroid-based clustering relies on the choice of the similarity measure under use. In recent years, most studies focused on including several divergence measures in the traditional hard k-means algorithm. In this article, we consider the problem of centroid-based clustering using the family of αβ-divergences, which is governed by two parameters, α and β. We propose a new iterative algorithm, αβ-k-means, giving closed-form solutions for the computation of the sided centroids. The algorithm can be fine-tuned by means of this pair of values, yielding a wide range of the most frequently used divergences. Moreover, it is guaranteed to converge to local minima for a wide range of values of the pair (α, β). Our theoretical contribution has been validated by several experiments performed with synthetic and real data and exploring the (α, β) plane. The numerical results obtained confirm the quality of the algorithm and its suitability to be used in several practical applicationsMinisterio de Economía y Competitividad de España (MINECO) TEC2017-82807-

    Informative Data Fusion: Beyond Canonical Correlation Analysis

    Full text link
    Multi-modal data fusion is a challenging but common problem arising in fields such as economics, statistical signal processing, medical imaging, and machine learning. In such applications, we have access to multiple datasets that use different data modalities to describe some system feature. Canonical correlation analysis (CCA) is a multidimensional joint dimensionality reduction algorithm for exactly two datasets. CCA finds a linear transformation for each feature vector set such that the correlation between the two transformed feature sets is maximized. These linear transformations are easily found by solving the SVD of a matrix that only involves the covariance and cross-covariance matrices of the feature vector sets. When these covariance matrices are unknown, an empirical version of CCA substitutes sample covariance estimates formed from training data. However, when the number of training samples is less than the combined dimension of the datasets, CCA fails to reliably detect correlation between the datasets. This thesis explores the the problem of detecting correlations from data modeled by the ubiquitous signal-plus noise data model. We present a modification to CCA, which we call informative CCA (ICCA) that first projects each dataset onto a low-dimensional informative signal subspace. We verify the superior performance of ICCA on real-world datasets and argue the optimality of trim-then-fuse over fuse-then-trim correlation analysis strategies. We provide a significance test for the correlations returned by ICCA and derive improved estimates of the population canonical vectors using insights from random matrix theory. We then extend the analysis of CCA to regularized CCA (RCCA) and demonstrate that setting the regularization parameter to infinity results in the best performance and has the same solution as taking the SVD of the cross-covariance matrix of the two datasets. Finally, we apply the ideas learned from ICCA to multiset CCA (MCCA), which analyzes correlations for more than two datasets. There are multiple formulations of multiset CCA (MCCA), each using a different combination of objective function and constraint function to describe a notion of multiset correlation. We consider MAXVAR, provide an informative version of the algorithm, which we call informative MCCA (IMCCA), and demonstrate its superiority on a real-world dataset.PHDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113419/1/asendorf_1.pd
    corecore