4 research outputs found

    Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

    Full text link
    Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or more complex distances. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm to achieve an O(k)-fold speedup in the second SWAP phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (at comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. With the substantially faster SWAP, we can now also explore alternative strategies for choosing the initial medoids. We also show how the CLARA and CLARANS algorithms benefit from these modifications. It can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100, we observed a 200-fold speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets as long as we can afford to compute a distance matrix, and in particular to higher k (at k=2, the new SWAP was only 1.5 times faster, as the speedup is expected to increase with k)

    Análises implícitas de dados na produção de conhecimento em Ciência da computação: um estudo bibliométrico

    Get PDF
    Trabalho de Conclusão de Curso, apresentado para obtenção do grau de Bacharel no Curso de Ciência da Computação da Universidade do Extremo Sul Catarinense, UNESC.Ciência e negócios são exemplos de áreas afetadas em decorrência do notável volume e variedade de dados atualmente disponíveis. Com isto uma área de estudos fica em evidencia, ciência dos dados. O grande desafio e analisar esta quantidade de dados e gerar informação. Necessitando o emprego de técnicas apropriadas, as análises implícitas. Dada a importância destes algoritmos em nosso cotidiano, as produções cientificas fundamentadas nesta área também avultam. Então, pela bibliometria, campo de estudo da ciência da informação, que de forma quantitativa e estatística avalia as produções cientificas. Este trabalho tem por objetivo, desenvolver uma pesquisa bibliométrica na ciência da computação a partir de trabalhos que empregam técnicas de analises implícitas. Além do mapeamento bibliométrico, também foi realizada a fundamentação teórica sobre ciência dos dados, analises implícitas e bibliometria. São abordadas as seguintes analises implicitas: Apriori, arvores de decisão, classificadores bayesianos, DBSCAN, FPGrowth, máquinas de vetores de suporte, redes neurais artificiais, k-means e kmedoid. Os artigos científicos analisados são oriundos de três bases de dados, SciElo, Scopus e Web of Science. A pesquisa seguiu os seguintes critérios de inclusão de arquivos: artigos aplicados a computação, utilizar alguma das analises implícitas e não ser uma bibliometria. Ao fim da pesquisa bibliométrica com volume de 46 artigos, dos quais foram obtidos resultados e conclusões relevantes ao cenário da pesquisa de analises implícitas em ciência da computação. Por meio do h-index, os três principais autores são: Brazdil Thomaš, Artur S. D'Avila Garcez e Mahajan, Meena com os respectivos h-index, quinze, treze e doze, e identificado que o pesquisador Ye, Yongkai destaca-se por ser o unico autor com mais de um trabalho nesta pesquisa, assim como, estabelece uma relação de coautoria em demais trabalhos. Ainda, o ano de 2018 foi o ano mais produtivo com dezesseis artigos, também destaca-se China e Índia pelas suas produtividades, nove e sete respectivamente. Também, a partir dos artigos destaca-se cinco grupos de pesquisas: Pesquisa e Desenvolvimento, Processamento de Linguagem Natural, Seguranca Computacional, Pesquisa e Indexação de Conteúdo e Ausência de Dados em datasets. As análises mais utilizadas foram árvores de decisão, Apriori e redes neurais artificiais. De acordo com os resultados obtidos, conclui-se que este campo de pesquisa encontra-se em crescimento, possui pelo menos duas subáreas de tendência de pesquisa: Pesquisa e Desenvolvimento Computacional e Processamento de Linguagem Natural, além de uma lacuna de pesquisa, Ausência de Dados em datasets. Ainda, entre os autores, confirma-se a existência de uma relação de cooperação qual e identificado pelos trabalhos do autor Ye, Yongkai e também os estudos apontam para análises mais utilizadas, árvores de decisão, Apriori e redes neurais artificiais

    The relationship of DBSCAN to matrix factorization and spectral clustering

    No full text
    DBSCAN is a popular approach for density-based clustering.\u3cbr/\u3eIn this short ``work in progress'' paper, we want to present an interpretation of\u3cbr/\u3eDBSCAN as a matrix factorization problem, which introduces\u3cbr/\u3ea theoretical connection (but not an equivalence)\u3cbr/\u3ebetween DBSCAN and Spectral Clustering (SC).\u3cbr/\u3e\u3cbr/\u3eWhile this does not yield a faster algorithm for DBSCAN,\u3cbr/\u3eestablishing this relationship is a step towards a more unified\u3cbr/\u3eview of clustering, by identifying further relationships between\u3cbr/\u3esome of the most popular clustering algorithms

    A mathematical theory of making hard decisions: model selection and robustness of matrix factorization with binary constraints

    Get PDF
    One of the first and most fundamental tasks in machine learning is to group observations within a dataset. Given a notion of similarity, finding those instances which are outstandingly similar to each other has manifold applications. Recommender systems and topic analysis in text data are examples which are most intuitive to grasp. The interpretation of the groups, called clusters, is facilitated if the assignment of samples is definite. Especially in high-dimensional data, denoting a degree to which an observation belongs to a specified cluster requires a subsequent processing of the model to filter the most important information. We argue that a good summary of the data provides hard decisions on the following question: how many groups are there, and which observations belong to which clusters? In this work, we contribute to the theoretical and practical background of clustering tasks, addressing one or both aspects of this question. Our overview of state-of-the-art clustering approaches details the challenges of our ambition to provide hard decisions. Based on this overview, we develop new methodologies for two branches of clustering: the one concerns the derivation of nonconvex clusters, known as spectral clustering; the other addresses the identification of biclusters, a set of samples together with similarity defining features, via Boolean matrix factorization. One of the main challenges in both considered settings is the robustness to noise. Assuming that the issue of robustness is controllable by means of theoretical insights, we have a closer look at those aspects of established clustering methods which lack a theoretical foundation. In the scope of Boolean matrix factorization, we propose a versatile framework for the optimization of matrix factorizations subject to binary constraints. Especially Boolean factorizations have been computed by intuitive methods so far, implementing greedy heuristics which lack quality guarantees of obtained solutions. In contrast, we propose to build upon recent advances in nonconvex optimization theory. This enables us to provide convergence guarantees to local optima of a relaxed objective, requiring only approximately binary factor matrices. By means of this new optimization scheme PAL-Tiling, we propose two approaches to automatically determine the number of clusters. The one is based on information theory, employing the minimum description length principle, and the other is a novel statistical approach, controlling the false discovery rate. The flexibility of our framework PAL-Tiling enables the optimization of novel factorization schemes. In a different context, where every data point belongs to a pre-defined class, a characterization of the classes may be obtained by Boolean factorizations. However, there are cases where this traditional factorization scheme is not sufficient. Therefore, we propose the integration of another factor matrix, reflecting class-specific differences within a cluster. Our theoretical considerations are complemented by empirical evaluations, showing how our methods combine theoretical soundness with practical advantages
    corecore