376 research outputs found

    BROCCOLI: overlapping and outlier-robust biclustering through proximal stochastic gradient descent

    Get PDF
    Matrix tri-factorization subject to binary constraints is a versatile and powerful framework for the simultaneous clustering of observations and features, also known as biclustering. Applications for biclustering encompass the clustering of high-dimensional data and explorative data mining, where the selection of the most important features is relevant. Unfortunately, due to the lack of suitable methods for the optimization subject to binary constraints, the powerful framework of biclustering is typically constrained to clusterings which partition the set of observations or features. As a result, overlap between clusters cannot be modelled and every item, even outliers in the data, have to be assigned to exactly one cluster. In this paper we propose Broccoli, an optimization scheme for matrix factorization subject to binary constraints, which is based on the theoretically well-founded optimization scheme of proximal stochastic gradient descent. Thereby, we do not impose any restrictions on the obtained clusters. Our experimental evaluation, performed on both synthetic and real-world data, and against 6 competitor algorithms, show reliable and competitive performance, even in presence of a high amount of noise in the data. Moreover, a qualitative analysis of the identified clusters shows that Broccoli may provide meaningful and interpretable clustering structures

    Multi-View Cluster Analysis With Incomplete Data to Understand Treatment Effects

    Get PDF
    Multi-view cluster analysis, as a popular granular computing method, aims to partition sample subjects into consistent clusters across different views in which the subjects are characterized. Frequently, data entries can be missing from some of the views. The latest multi-view co-clustering methods cannot effectively deal with incomplete data, especially when there are mixed patterns of missing values. We propose an enhanced formulation for a family of multi-view co-clustering methods to cope with the missing data problem by introducing an indicator matrix whose elements indicate which data entries are observed and assessing cluster validity only on observed entries. In comparison with the simple strategy of removing subjects with missing values, our approach can use all available data in cluster analysis. In comparison with common methods that impute missing data in order to use regular multi-view analytics, our approach is less sensitive to imputation uncertainty. In comparison with other state-of-the-art multi-view incomplete clustering methods, our approach is sensible in the cases of missing any value in a view or missing the entire view, the most common scenario in practice. We first validated the proposed strategy in simulations, and then applied it to a treatment study of heroin dependence which would have been impossible with previous methods due to a number of missing-data patterns. Patients in a treatment study were naturally assessed in different feature spaces such as in the pre-, during-and post-treatment time windows. Our algorithm was able to identify subgroups where patients in each group showed similarities in all of the three time windows, thus leading to the recognition of pre-treatment (baseline) features predictive of post-treatment outcomes
    • …
    corecore