51 research outputs found

    A Survey on Soft Subspace Clustering

    Full text link
    Subspace clustering (SC) is a promising clustering technology to identify clusters based on their associations with subspaces in high dimensional spaces. SC can be classified into hard subspace clustering (HSC) and soft subspace clustering (SSC). While HSC algorithms have been extensively studied and well accepted by the scientific community, SSC algorithms are relatively new but gaining more attention in recent years due to better adaptability. In the paper, a comprehensive survey on existing SSC algorithms and the recent development are presented. The SSC algorithms are classified systematically into three main categories, namely, conventional SSC (CSSC), independent SSC (ISSC) and extended SSC (XSSC). The characteristics of these algorithms are highlighted and the potential future development of SSC is also discussed.Comment: This paper has been published in Information Sciences Journal in 201

    Sparsity-Inducing Fuzzy Subspace Clustering

    Get PDF
    This paper considers a fuzzy subspace clustering problem and proposes to introduce an original sparsity-inducing regularization term. The minimization of this term, which involves a l0_{0} penalty, is considered from a geometric point of view and a novel proximal operator is derived. A subspace clustering algorithm, Prosecco, is proposed to optimize the cost function using both proximal and alternate gradient descent. Experiments comparing this algorithm to the state of the art in sparse fuzzy subspace clustering show the relevance of the proposed approach

    Rough Fuzzy Subspace Clustering for Data with Missing Values

    Get PDF
    The paper presents rough fuzzy subspace clustering algorithm and experimental results of clustering. In this algorithm three approaches for handling missing values are used: marginalisation, imputation and rough sets. The algorithm also assigns weights to attributes in each cluster; this leads to subspace clustering. The parameters of clusters are elaborated in the iterative procedure based on minimising of criterion function. The crucial parameter of the proposed algorithm is the parameter having the influence on the sharpness of elaborated subspace cluster. The lower values of the parameter lead to selection of the most important attribute. The higher values create clusters in the global space, not in subspaces. The paper is accompanied by results of clustering of synthetic and real life data sets

    Enhanced clustering analysis pipeline for performance analysis of parallel applications

    Get PDF
    Clustering analysis is widely used to stratify data in the same cluster when they are similar according to the specific metrics. We can use the cluster analysis to group the CPU burst of a parallel application, and the regions on each process in-between communication calls or calls to the parallel runtime. The resulting clusters obtained are the different computational trends or phases that appear in the application. These clusters are useful to understand the behavior of the computation part of the application and focus the analyses on those that present performance issues. Although density-based clustering algorithms are a powerful and efficient tool to summarize this type of information, their traditional user-guided clustering methodology has many shortcomings and deficiencies in dealing with the complexity of data, the diversity of data structures, high-dimensionality of data, and the dramatic increase in the amount of data. Consequently, the majority of DBSCAN-like algorithms have weaknesses to handle high-dimensionality and/or Multi-density data, and they are sensitive to their hyper-parameter configuration. Furthermore, extracting insight from the obtained clusters is an intuitive and manual task. To mitigate these weaknesses, we have proposed a new unified approach to replace the user-guided clustering with an automated clustering analysis pipeline, called Enhanced Cluster Identification and Interpretation (ECII) pipeline. To build the pipeline, we propose novel techniques including Robust Independent Feature Selection, Feature Space Curvature Map, Organization Component Analysis, and hyper-parameters tuning to feature selection, density homogenization, cluster interpretation, and model selection which are the main components of our machine learning pipeline. This thesis contributes four new techniques to the Machine Learning field with a particular use case in Performance Analytics field. The first contribution is a novel unsupervised approach for feature selection on noisy data, called Robust Independent Feature Selection (RIFS). Specifically, we choose a feature subset that contains most of the underlying information, using the same criteria as the Independent component analysis. Simultaneously, the noise is separated as an independent component. The second contribution of the thesis is a parametric multilinear transformation method to homogenize cluster densities while preserving the topological structure of the dataset, called Feature Space Curvature Map (FSCM). We present a new Gravitational Self-organizing Map to model the feature space curvature by plugging the concepts of gravity and fabric of space into the Self-organizing Map algorithm to mathematically describe the density structure of the data. To homogenize the cluster density, we introduce a novel mapping mechanism to project the data from the non-Euclidean curved space to a new Euclidean flat space. The third contribution is a novel topological-based method to study potentially complex high-dimensional categorized data by quantifying their shapes and extracting fine-grain insights from them to interpret the clustering result. We introduce our Organization Component Analysis (OCA) method for the automatic arbitrary cluster-shape study without an assumption about the data distribution. Finally, to tune the DBSCAN hyper-parameters, we propose a new tuning mechanism by combining techniques from machine learning and optimization domains, and we embed it in the ECII pipeline. Using this cluster analysis pipeline with the CPU burst data of a parallel application, we provide the developer/analyst with a high-quality SPMD computation structure detection with the added value that reflects the fine grain of the computation regions.El análisis de conglomerados se usa ampliamente para estratificar datos en el mismo conglomerado cuando son similares según las métricas específicas. Nosotros puede usar el análisis de clúster para agrupar la ráfaga de CPU de una aplicación paralela y las regiones en cada proceso intermedio llamadas de comunicación o llamadas al tiempo de ejecución paralelo. Los clusters resultantes obtenidos son las diferentes tendencias computacionales o fases que aparecen en la solicitud. Estos clusters son útiles para entender el comportamiento de la parte de computación del aplicación y centrar los análisis en aquellos que presenten problemas de rendimiento. Aunque los algoritmos de agrupamiento basados en la densidad son una herramienta poderosa y eficiente para resumir este tipo de información, su La metodología tradicional de agrupación en clústeres guiada por el usuario tiene muchas deficiencias y deficiencias al tratar con la complejidad de los datos, la diversidad de estructuras de datos, la alta dimensionalidad de los datos y el aumento dramático en la cantidad de datos. En consecuencia, el La mayoría de los algoritmos similares a DBSCAN tienen debilidades para manejar datos de alta dimensionalidad y/o densidad múltiple, y son sensibles a su configuración de hiperparámetros. Además, extraer información de los clústeres obtenidos es una forma intuitiva y tarea manual Para mitigar estas debilidades, hemos propuesto un nuevo enfoque unificado para reemplazar el agrupamiento guiado por el usuario con un canalización de análisis de agrupamiento automatizado, llamada canalización de identificación e interpretación de clúster mejorada (ECII). para construir el tubería, proponemos técnicas novedosas que incluyen la selección robusta de características independientes, el mapa de curvatura del espacio de características, Análisis de componentes de la organización y ajuste de hiperparámetros para la selección de características, homogeneización de densidad, agrupación interpretación y selección de modelos, que son los componentes principales de nuestra canalización de aprendizaje automático. Esta tesis aporta cuatro nuevas técnicas al campo de Machine Learning con un caso de uso particular en el campo de Performance Analytics. La primera contribución es un enfoque novedoso no supervisado para la selección de características en datos ruidosos, llamado Robust Independent Feature. Selección (RIFS).Específicamente, elegimos un subconjunto de funciones que contiene la mayor parte de la información subyacente, utilizando el mismo criterios como el análisis de componentes independientes. Simultáneamente, el ruido se separa como un componente independiente. La segunda contribución de la tesis es un método de transformación multilineal paramétrica para homogeneizar densidades de clústeres mientras preservando la estructura topológica del conjunto de datos, llamado Mapa de Curvatura del Espacio de Características (FSCM). Presentamos un nuevo Gravitacional Mapa autoorganizado para modelar la curvatura del espacio característico conectando los conceptos de gravedad y estructura del espacio en el Algoritmo de mapa autoorganizado para describir matemáticamente la estructura de densidad de los datos. Para homogeneizar la densidad del racimo, introducimos un mecanismo de mapeo novedoso para proyectar los datos del espacio curvo no euclidiano a un nuevo plano euclidiano espacio. La tercera contribución es un nuevo método basado en topología para estudiar datos categorizados de alta dimensión potencialmente complejos mediante cuantificando sus formas y extrayendo información detallada de ellas para interpretar el resultado de la agrupación. presentamos nuestro Método de análisis de componentes de organización (OCA) para el estudio automático de forma arbitraria de conglomerados sin una suposición sobre el distribución de datos.Postprint (published version

    Topological data analysis of organoids

    Get PDF
    Organoids are multi-cellular structures which are cultured in vitro from stem cells to resemble specific organs (e.g., colon, liver) in their three- dimensional composition. The gene expression and the tissue composition of organoids constantly affect each other. Dynamic changes in the shape, cellular composition and transcriptomic profile of these model systems can be used to understand the effect of mutations and treatments in health and disease. In this thesis, I propose new techniques in the field of topological data analysis (TDA) to analyse the gene expression and the morphology of organoids. I use TDA methods, which are inspired by topology, to analyse and quantify the continuous structure of single-cell RNA sequencing data, which is embedded in high dimensional space, and the shape of an organoid. For single-cell RNA sequencing data, I developed the multiscale Laplacian score (MLS) and the UMAP diffusion cover, which both extend and im- prove existing topological analysis methods. I demonstrate the utility of these techniques by applying them to a published benchmark single-cell data set and a data set of mouse colon organoids. The methods validate previously identified genes and detect additional genes with known involvement cancers. To study the morphology of organoids I propose DETECT, a rotationally invariant signature of dynamically changing shapes. I demonstrate the efficacy of this method on a data set of segmented videos of mouse small intestine organoid experiments and show that it outperforms classical shape descriptors. I verify the method on a synthetic organoid data set and illustrate how it generalises to 3D to conclude that DETECT offers rigorous quantification of organoids and opens up computationally scalable methods for distinguishing different growth regimes and assessing treatment effects. Finally, I make a theoretical contribution to the statistical inference of the method underlying DETECT

    Spectral clustering and downsampling-based model selection for functional data

    Get PDF
    Functional data analysis is a growing field of research and has been employed in a wide range of applications ranging from genetics in biology to stock markets in economics. A crucial but challenging problem is clustering of functional data. In this thesis, we review the main contributions in this field and discuss the strengthens and weaknesses of the different clustering functional data approaches. We propose a new framework for clustering functional data and a new paradigm for model selection that is specifically designed for functional data, which are designed to address many of the weaknesses of existing techniques. Our clustering framework is based on first reducing the infinite dimensional space of functional data to a finite dimensional space by smoothing and basis expansion. Then we implement the spectral clustering approach by designing a new distance measure which has the flexibility of using the distance between the original trajectories and/or their derivatives. In addition, we develop a new model selection criterion, by introducing a technique called ‘downsampling’, which allows us to create lower resolution replicates of the observed curves. These replicates can then be used to examine the clustering stability of the existing clustering functional data approaches and select the optimal number of clusters. Further, we combine the two proposed techniques to develop an integrated clustering framework to estimate the number of clusters inherently and accordingly cluster the functional data. An extensive simulation study with existing clustering functional data methods show a superior performance of our clustering framework and reliable results of the proposed model selection criteria. The usefulness of these new approaches is also illustrated through applications to real data

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference

    Get PDF
    corecore