119 research outputs found

    SMART: Unique splitting-while-merging framework for gene clustering

    Get PDF
    Copyright @ 2014 Fa et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.National Institute for Health Researc

    Clustering and its Application in Requirements Engineering

    Get PDF
    Large scale software systems challenge almost every activity in the software development life-cycle, including tasks related to eliciting, analyzing, and specifying requirements. Fortunately many of these complexities can be addressed through clustering the requirements in order to create abstractions that are meaningful to human stakeholders. For example, the requirements elicitation process can be supported through dynamically clustering incoming stakeholders’ requests into themes. Cross-cutting concerns, which have a significant impact on the architectural design, can be identified through the use of fuzzy clustering techniques and metrics designed to detect when a theme cross-cuts the dominant decomposition of the system. Finally, traceability techniques, required in critical software projects by many regulatory bodies, can be automated and enhanced by the use of cluster-based information retrieval methods. Unfortunately, despite a significant body of work describing document clustering techniques, there is almost no prior work which directly addresses the challenges, constraints, and nuances of requirements clustering. As a result, the effectiveness of software engineering tools and processes that depend on requirements clustering is severely limited. This report directly addresses the problem of clustering requirements through surveying standard clustering techniques and discussing their application to the requirements clustering process

    DSMK-means “Density-based Split-and-Merge K-means clustering Algorithm”

    Get PDF
    Clustering is widely used to explore and understand large collections of data. K-means clustering method is one of the most popular approaches due to its ease of use and simplicity to implement. This paper introduces Density-based Split-and-Merge K-means clustering Algorithm (DSMK-means), which is developed to address stability problems of standard K-means clustering algorithm, and to improve the performance of clustering when dealing with datasets that contain clusters with different complex shapes and noise or outliers. Based on a set of many experiments, this paper concluded that developed algorithms “DSMK-means” are more capable of finding high accuracy results compared with other algorithms especially as they can process datasets containing clusters with different shapes, densities, or those with outliers and noise

    On Data-Independent Properties for Density-Based Dissimilarity Measures in Hybrid Clustering

    Get PDF
    Hybrid clustering combines partitional and hierarchical clustering for computational effectiveness and versatility in cluster shape. In such clustering, a dissimilarity measure plays a crucial role in the hierarchical merging. The dissimilarity measure has great impact on the final clustering, and data-independent properties are needed to choose the right dissimilarity measure for the problem at hand. Properties for distance-based dissimilarity measures have been studied for decades, but properties for density-based dissimilarity measures have so far received little attention. Here, we propose six data-independent properties to evaluate density-based dissimilarity measures associated with hybrid clustering, regarding equality, orthogonality, symmetry, outlier and noise observations, and light-tailed models for heavy-tailed clusters. The significance of the properties is investigated, and we study some well-known dissimilarity measures based on Shannon entropy, misclassification rate, Bhattacharyya distance and Kullback-Leibler divergence with respect to the proposed properties. As none of them satisfy all the proposed properties, we introduce a new dissimilarity measure based on the Kullback-Leibler information and show that it satisfies all proposed properties. The effect of the proposed properties is also illustrated on several real and simulated data sets

    Mineral Mapping on Hyperspectral Imageries Using Cohesion-based Self Merging Algorithm

    Get PDF
    Recently, hybrid clustering algorithms gained much research attention due to better clustering results and are computationally efficient. Hyperspectral image classification studies should be no exception, including mineral mapping. This study aims to tackle the biggest challenge of mapping the mineralogy of drill core samples, which consumes a lot of time. In this paper, we present the investigation using a hybrid clustering algorithm, cohesion-based self-merging (CSM), for mineral mapping to determine the number and location of minerals that formed the rock. The CSM clustering performance was then compared to its classical counterpart, K-means plus-plus (K-means++). We conducted experiments using hyperspectral images from multiple rock samples to understand how well the clustering algorithm segmented minerals that exist in the rock. The samples in this study contain minerals with identical absorption features in certain locations that increase the complexity. The elbow method and silhouette analysis did not perform well in deciding the optimum cluster size due to slight variance and high dimensionality of the datasets. Thus, iterations to the various numbers of k-clusters and m-subclusters of each rock were performed to get the mineral cluster. Both algorithms were able to distinguish slight variations of absorption features of any mineral. The spectral variation within a single mineral found by our algorithm might be studied further to understand any possible unidentified group of clusters. The spatial consideration of the CSM algorithm induced several misclassified pixels. Hence, the mineral maps produced in this study are not expected to be precisely similar to ground truths

    Detecting threatening insiders with lightweight media forensics

    Get PDF
    This research uses machine learning and outlier analysis to detect potentially hostile insiders through the automated analysis of stored data on cell phones, laptops, and desktop computers belonging to members of an organization. Whereas other systems look for specific signatures associated with hostile insider activity, our system is based on the creation of a “storage profile” for each user and then an automated analysis of all the storage profiles in the organization, with the purpose of finding storage outliers. Our hypothesis is that malicious insiders will have specific data and concentrations of data that differ from their colleagues and coworkers. By exploiting these differences, we can identify potentially hostile insiders. Our system is based on a combination of existing open source computer forensic tools and datamining algorithms. We modify these tools to perform a “lightweight” analysis based on statistical sampling over time. In this, our approach is both efficient and privacy sensitive. As a result, we can detect not just individuals that differ from their co-workers, but also insiders that differ from their historic norms. Accordingly, we should be able to detect insiders that have been “turned” by events or outside organizations. We should also be able to detect insider accounts that have been taken over by outsiders. Our project, now in its first year, is a three-year project funded by the Department of Homeland Security, Science and Technology Directorate, Cyber Security Division. In this paper we describe the underlying approach and demonstrate how the storage profile is created and collected using specially modified open source tools. We also present the results of running these tools on a 500GB corpus of simulated insider threat data created by the Naval Postgraduate School in 2008 under grant from the National Science Foundation

    Clustering analysis for gene expression data: a methodological review

    Get PDF
    Clustering is one of most useful tools for the microarray gene expression data analysis. Although there have been many reviews and surveys in the literature, many good and effective clustering ideas have not been collected in a systematic way for some reasons. In this paper, we review five clustering families representing five clustering concepts rather than five algorithms. We also review some clustering validations and collect a list of benchmark gene expression datasets

    Development of a R package to facilitate the learning of clustering techniques

    Get PDF
    This project explores the development of a tool, in the form of a R package, to ease the process of learning clustering techniques, how they work and what their pros and cons are. This tool should provide implementations for several different clustering techniques with explanations in order to allow the student to get familiar with the characteristics of each algorithm by testing them against several different datasets while deepening their understanding of them through the explanations. Additionally, these explanations should adapt to the input data, making the tool not only adept for self-regulated learning but for teaching too.Grado en IngenierĂ­a InformĂĄtic

    An exploration of methodologies to improve semi-supervised hierarchical clustering with knowledge-based constraints

    Get PDF
    Clustering algorithms with constraints (also known as semi-supervised clustering algorithms) have been introduced to the field of machine learning as a significant variant to the conventional unsupervised clustering learning algorithms. They have been demonstrated to achieve better performance due to integrating prior knowledge during the clustering process, that enables uncovering relevant useful information from the data being clustered. However, the research conducted within the context of developing semi-supervised hierarchical clustering techniques are still an open and active investigation area. Majority of current semi-supervised clustering algorithms are developed as partitional clustering (PC) methods and only few research efforts have been made on developing semi-supervised hierarchical clustering methods. The aim of this research is to enhance hierarchical clustering (HC) algorithms based on prior knowledge, by adopting novel methodologies. [Continues.

    Using Hierarchical Clustering for Learning the Ontologies used in Recommendation Systems

    Get PDF
    Ontologies are being successfully used to overcome semantic heterogeneity, and are becoming fundamental elements of the Semantic Web. Recently, it has also been shown that ontologies can be used to build more accurate and more personalized recommendation systems by inferencing missing user's preferences. However, these systems assume the existence of ontologies, without considering their construction. With product catalogs changing continuously, new techniques are required in order to build these ontologies in real time, and autonomously from any expert intervention. This paper focuses on this problem and show that it is possible to learn ontologies autonomously by using clustering algorithms. Results on the MovieLens and Jester data sets show that recommender system with learnt ontologies significantly outperform the classical recommendation approach
    • 

    corecore