7 research outputs found

    Algorithms for additive overlapping clustering.

    No full text
    In statistical practice, one is often interested in finding structures w ithin a given group of I objects. In several cases, such structures may be based on a number of subgroups of the objects under study. To reveal subgroups in a group of objects, one may appeal to some form of cluster analysis. Several clustering models exist depending on the type of data and the type of clustering. In this dissertation, all clustering models imply an overlapping clustering of the objects. Furthermore, all models imply an approximation of the given data and, in the associated data ana lysis, are fitted to a given data set by means of minimizing a least squ ares loss function. In Chapter 1, three-way two-mode I x I x J object by object by source da ta blocks are considered consisting of J object by object similarity mat rices of size I x I stemming from J different sources. A well known clus tering model for this type of data is the INDCLUS model which implies an overlapping clustering of the objects, with each cluster having a slice -specific positive weight. In the data analysis, the minimization of the least squares loss function has appeared to be a difficult problem for which several algorithmic strategies have been proposed. At present, the best available option seems to be the SYMPRES algorithm. Yet, SYMPRES i s conjectured to suffer from a severe local optima problem. As a way out , we will propose four new ALS algorithms. In a simulation study, one of the new algorithms is shown to outperform SYMPRES in terms of optimizat ion performance, cluster recovery and computational effort. In Chapter 2, two-way two-mode I x J object by variable data will be con sidered. The one-mode additive clustering model implies an overlapping c lustering of the objects and a set of profiles (characteristic variable values for each cluster). The model values of the variables of an object are the sum of the profiles of its corresponding clusters. In the data analysis, a sequential fitting strategy, also called the method of princ ipal clusters (PCL), has been proposed for the minimization of the least squares loss function. Theoretical and empirical evidence that the PCL algorithm may have problems in revealing the true structure underlying a data set will be presented. As a way out, three new algorithms to fit t he model to empirical data will be presented: two of an alternating leas t squares (ALS) type, orthogonally combined with two different starting strategies, and one based on simulated annealing (SA). In a simulation s tudy it is demonstrated that all three new algorithms outperform the exi sting PCL algorithm. The amount of objects that belong to more than one cluster (the overlap) is further found to have a considerable influence on the algorithmic performance of the ALS algorithms, with low amounts o f overlap requiring a different starting strategy than high ones. As a c onsequence, for the analysis of real data sets in practice, a hybrid app roach will be presented consisting of one of the ALS algorithms initiali zed by means of the two starting strategies under study. The profiles of the one-mode additive clustering model are the key to in terpret the clusters. The interpretation may, however, be seriously hamp ered in case the data include a very large number of variables. To deal with this problem, we propose in Chapter 3 a new model that simultaneous ly clusters the objects in overlapping clusters and reduces the variable space; as such, the model implies that the cluster profiles and, hence, the reconstructed data are constrained to lie in a low dimensional spac e. An ALS algorithm to fit the new model to a given data set will be pre sented, along with a simulation study and an illustrative example that m akes use of empirical data. In Chapter 4, we will consider the additive biclustering for two-way two -mode object by variable data. This model implies an overlapping cluster ing of both the objects and the variables, together with a core value fo r each bicluster (i.e., a pair of an object and a variable cluster). The value of a model entry then equals the sum of the core values of the bi clusters the entry belongs to. To minimize the least squares loss functi on, two algorithms have been proposed in the literature: PENCLUS and Bai er s ALS approach. However, up to now, these algorithms have not yet bee n systematically evaluated in a simulation study. Moreover, both algorit hms have some built-in limitations, the impact of which on their perform ance is unclear. As a way out, first, a new ALS algorithm will be presen ted. Second, we will report on a simulation study in which the two exist ing and the new algorithm are evaluated. From this study, the new ALS al gorithm appears to outperform the two existing strategies. Little is yet known about the identifiability of the additive clustering models in question, which may hamper their interpretability. Therefore, in Chapter 5, we will first present a necessary condition on the cluste ring structure for the one-mode additive clustering model to be identifi ed. This condition is further shown to imply that hierarchies are never identified. Second, we will show that the inclusion of an offset term in one-mode additive clustering models implies an unidentifiability that c an only be resolved by adding constraints to the model. In the discussio n, we show that the necessary condition for identifiability can be exten ded to the additive biclustering model, whereas it is not necessary for the identifiability of the ADCLUS/INDCLUS model for two-way one-mode sim ilarities and for Mirkin s additive boxclustering model.Introduction 1 New ALS approaches for fitting the INDCLUS model 1.1 Introduction 1.2 The SYMPRES algorithm 1.3 Four new ALS approaches 1.4 Simulation study 1.4.1 Introduction 1.4.2 Design 1.4.3 Results 1.4.4 Discussion 1.5 Concluding remarks References 2 Algorithms for additive clustering of rectangular data tables 2.1 Introduction 2.2 The One-mode Additive Clustering Model 2.2.1 The model 2.2.2 Illustrative Application 2.2.3 Data Analysis: Loss functions 2.3 Principal Cluster Analysis 2.3.1 PCL Algorithm 2.3.2 Some problems with PCL 2.4 Three new algorithms 2.4.1 ALS-lf2 2.4.2 ALS-lf1 2.4.3 SA-lf1 2.5 Simulation Study 2.5.1 Introduction 2.5.2 Data generation 2.5.3 Algorithms 2.5.4 Evaluation Criteria 2.5.5 Results 2.5.6 Discussion 2.6 Concluding remarks References 2.A Pseudocodes 2.A.1 PCL pseudocode 2.A.2 ALS-lf2 Pseudocode 2.A.3 ALS-lf1 Pseudocode 2.A.4 Simulated Annealing Pseudocode 3 Lowdimensional overlapping clustering 3.1 Introduction 3.2 The Low Dimensional Overlapping Clustering Model 3.3 Estimation 3.3.1 Loss Function 3.3.2 Algorithm 3.4 Simulation Study 3.4.1 Introduction 3.4.2 Design 3.4.3 Evaluation Criteria 3.4.4 Results and Discussion 3.5 Application 3.6 Discussion References 3.A Descriptions of the situations 4 Additive biclustering: A comparison of one new and two existing algorithms 4.1 Introduction 4.2 Additive biclustering model 4.3 Data Analysis 4.3.1 PENCLUS 4.3.2 Clusterwise ALS 4.3.3 A novel algorithm: Full clustering ALS 4.4 Simulation Study 4.4.1 Introduction 4.4.2 Design 4.4.3 Evaluation Criteria 4.4.4 Results 4.4.5 Discussion 4.5 Conclusion References 5 Uniqueness of Additive Clustering Models 5.1 Introduction 5.2 Additive Clustering Models 5.3 Identifiability 5.3.1 General Issues 5.3.2 Offset terms 5.4 Discussion 5.4.1 Mirkin’s Binary Hierarchy model 5.4.2 The ADCLUS and INDCLUS models for similarity data 5.4.3 The box clustering model 5.4.4 The additive biclustering model References 5.A Counterexample ADCLUS 5.B Counterexample additive boxclustering model Discussion Nederlandstalige samenvattingstatus: publishe

    Lowdimensional additive overlapping clustering

    No full text
    To reveal the structure underlying two-way two-mode object by variable data, Mirkin (1987) has proposed an additive overlapping clustering model. This model implies an overlapping clustering of the objects and a reconstruction of the data, with the reconstructed variable profile of an object being a summation of the variable profiles of the clusters it belongs to. Grasping the additive (overlapping) clustering structure of object by variable data may, however, be seriously hampered in case the data include a very large number of variables. To deal with this problem, we propose a new model that simultaneously clusters the objects in overlapping clusters and reduces the variable space; as such, the model implies that the cluster profiles and, hence, the reconstructed data profiles are constrained to lie in a lowdimensional space. An alternating least squares (ALS) algorithm to fit the new model to a given data set will be presented, along with a simulation study and an illustrative example that makes use of empirical data.status: publishe

    Algorithms for additive clustering of rectangular data tables

    No full text
    The overlapping additive clustering model or principal cluster model is a model for two-way two-mode object by variable data that implies an overlapping clustering of the objects and a set of profiles (characteristic variable values for each cluster). The model values of the variables of an object are the sum of the profiles of its corresponding clusters. In the associated data analysis the data matrix at hand is approximated by an overlapping additive clustering model of a prespecified rank by minimizing a least squares loss function. Recently an algorithm has been proposed for this purpose. This algorithm is a sequential fitting strategy, also called the method of principal clusters (PCL). Theoretical and empirical evidence that the PCL algorithm may have problems in revealing the true structure underlying a data set will be presented. As a way out, three new algorithms to fit the principal cluster model to empirical data will be presented: two of an alternating least squares (ALS) type, orthogonally combined with two different starting strategies, and one based on simulated annealing (SA). In a simulation study it is demonstrated that all three new algorithms outperform the existing PCL algorithm. The amount of objects that belong to more than one cluster (the overlap) is further found to have a considerable influence on the algorithmic performance of the ALS algorithms, with low amounts of overlap requiring a different starting strategy than high ones. As a consequence, for the analysis of real data sets in practice, a hybrid approach will be presented consisting of one of the ALS algorithms initialized by means of the two starting strategies under study.

    Additive biclustering: A comparison of one new and two existing ALS algorithms

    No full text
    The additive biclustering model for two-way two-mode object by variable data implies overlapping clusterings of both the objects and the variables together with a weight for each bicluster (i.e., a pair of an object and a variable cluster). In the data analysis, an additive biclustering model is fitted to given data by means of minimizing a least squares loss function. To this end, two alternating least squares algorithms (ALS) may be used: (1) PENCLUS, and (2) Baier's ALS approach. However, both algorithms suffer from some inherent limitations, which may hamper their performance. As a way out, based on theoretical results regarding optimally designing ALS algorithms, in this paper a new ALS algorithm will be presented. In a simulation study this algorithm will be shown to outperform the existing ALS approaches.status: publishe

    Block-relaxation approaches for fitting the INDCLUS model

    No full text
    A well-known clustering model to represent I x I x J data blocks, the J frontal slices of which consist of I x I object by object similarity matrices, is the INDCLUS model. This model implies a grouping of the I objects into a prespecified number of overlapping clusters, with each cluster having a slice-specific positive weight. An INDCLUS model is fitted to a given data set by means of minimizing a least squares loss function. The minimization of this loss function has appeared to be a difficult problem for which several algorithmic strategies have been proposed. At present, the best available option seems to be the SYMPRES algorithm, which minimizes the loss function by means of a block-relaxation algorithm. Yet, SYMPRES is conjectured to suffer from a severe local optima problem. As a way out, based on theoretical results with respect to optimally designing block-relaxation algorithms, five alternative block-relaxation algorithms are proposed. In a simulation study it appears that the alternative algorithms with overlapping parameter subsets perform best and clearly outperform SYMPRES in terms of optimization performance and cluster recovery.status: publishe
    corecore