119 research outputs found
SMART: Unique splitting-while-merging framework for gene clustering
Copyright @ 2014 Fa et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and source are credited.Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named âsplitting merging awareness tacticsâ (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.National Institute for Health Researc
Clustering and its Application in Requirements Engineering
Large scale software systems challenge almost every activity in the software development life-cycle, including tasks related to eliciting, analyzing, and specifying requirements. Fortunately many of these complexities can be addressed through clustering the requirements in order to create abstractions that are meaningful to human stakeholders. For example, the requirements elicitation process can be supported through dynamically clustering incoming stakeholdersâ requests into themes. Cross-cutting concerns, which have a significant impact on the architectural design, can be identified through the use of fuzzy clustering techniques and metrics designed to detect when a theme cross-cuts the dominant decomposition of the system. Finally, traceability techniques, required in critical software projects by many regulatory bodies, can be automated and enhanced by the use of cluster-based information retrieval methods. Unfortunately, despite a significant body of work describing document clustering techniques, there is almost no prior work which directly addresses the challenges, constraints, and nuances of requirements clustering. As a result, the effectiveness of software engineering tools and processes that depend on requirements clustering is severely limited. This report directly addresses the problem of clustering requirements through surveying standard clustering techniques and discussing their application to the requirements clustering process
DSMK-means âDensity-based Split-and-Merge K-means clustering Algorithmâ
Clustering is widely used to explore and understand large collections of data. K-means clustering method is one of the most popular approaches due to its ease of use and simplicity to implement. This paper introduces Density-based Split-and-Merge K-means clustering Algorithm (DSMK-means), which is developed to address stability problems of standard K-means clustering algorithm, and to improve the performance of clustering when dealing with datasets that contain clusters with different complex shapes and noise or outliers. Based on a set of many experiments, this paper concluded that developed algorithms âDSMK-meansâ are more capable of finding high accuracy results compared with other algorithms especially as they can process datasets containing clusters with different shapes, densities, or those with outliers and noise
On Data-Independent Properties for Density-Based Dissimilarity Measures in Hybrid Clustering
Hybrid clustering combines partitional and hierarchical clustering for
computational effectiveness and versatility in cluster shape. In such
clustering, a dissimilarity measure plays a crucial role in the hierarchical
merging. The dissimilarity measure has great impact on the final clustering,
and data-independent properties are needed to choose the right dissimilarity
measure for the problem at hand. Properties for distance-based dissimilarity
measures have been studied for decades, but properties for density-based
dissimilarity measures have so far received little attention. Here, we propose
six data-independent properties to evaluate density-based dissimilarity
measures associated with hybrid clustering, regarding equality, orthogonality,
symmetry, outlier and noise observations, and light-tailed models for
heavy-tailed clusters. The significance of the properties is investigated, and
we study some well-known dissimilarity measures based on Shannon entropy,
misclassification rate, Bhattacharyya distance and Kullback-Leibler divergence
with respect to the proposed properties. As none of them satisfy all the
proposed properties, we introduce a new dissimilarity measure based on the
Kullback-Leibler information and show that it satisfies all proposed
properties. The effect of the proposed properties is also illustrated on
several real and simulated data sets
Mineral Mapping on Hyperspectral Imageries Using Cohesion-based Self Merging Algorithm
Recently, hybrid clustering algorithms gained much research attention due to better clustering results and are computationally efficient. Hyperspectral image classification studies should be no exception, including mineral mapping. This study aims to tackle the biggest challenge of mapping the mineralogy of drill core samples, which consumes a lot of time. In this paper, we present the investigation using a hybrid clustering algorithm, cohesion-based self-merging (CSM), for mineral mapping to determine the number and location of minerals that formed the rock. The CSM clustering performance was then compared to its classical counterpart, K-means plus-plus (K-means++). We conducted experiments using hyperspectral images from multiple rock samples to understand how well the clustering algorithm segmented minerals that exist in the rock. The samples in this study contain minerals with identical absorption features in certain locations that increase the complexity. The elbow method and silhouette analysis did not perform well in deciding the optimum cluster size due to slight variance and high dimensionality of the datasets. Thus, iterations to the various numbers of k-clusters and m-subclusters of each rock were performed to get the mineral cluster. Both algorithms were able to distinguish slight variations of absorption features of any mineral. The spectral variation within a single mineral found by our algorithm might be studied further to understand any possible unidentified group of clusters. The spatial consideration of the CSM algorithm induced several misclassified pixels. Hence, the mineral maps produced in this study are not expected to be precisely similar to ground truths
Detecting threatening insiders with lightweight media forensics
This research uses machine learning and outlier
analysis to detect potentially hostile insiders through the automated
analysis of stored data on cell phones, laptops, and desktop
computers belonging to members of an organization. Whereas
other systems look for specific signatures associated with hostile
insider activity, our system is based on the creation of a âstorage
profileâ for each user and then an automated analysis of all the
storage profiles in the organization, with the purpose of finding
storage outliers. Our hypothesis is that malicious insiders will
have specific data and concentrations of data that differ from
their colleagues and coworkers. By exploiting these differences,
we can identify potentially hostile insiders. Our system is based on a combination of existing open source
computer forensic tools and datamining algorithms. We modify
these tools to perform a âlightweightâ analysis based on statistical
sampling over time. In this, our approach is both efficient and
privacy sensitive. As a result, we can detect not just individuals
that differ from their co-workers, but also insiders that differ
from their historic norms. Accordingly, we should be able to
detect insiders that have been âturnedâ by events or outside
organizations. We should also be able to detect insider accounts
that have been taken over by outsiders.
Our project, now in its first year, is a three-year project
funded by the Department of Homeland Security, Science and
Technology Directorate, Cyber Security Division. In this paper
we describe the underlying approach and demonstrate how the
storage profile is created and collected using specially modified
open source tools. We also present the results of running these
tools on a 500GB corpus of simulated insider threat data created
by the Naval Postgraduate School in 2008 under grant from the
National Science Foundation
Clustering analysis for gene expression data: a methodological review
Clustering is one of most useful tools for the microarray gene expression data analysis. Although there have been many reviews and surveys in the literature, many good and effective clustering ideas have not been collected in a systematic way for some reasons. In this paper, we review five clustering families representing five clustering concepts rather than five algorithms. We also review some clustering validations and collect a list of benchmark gene expression datasets
Development of a R package to facilitate the learning of clustering techniques
This project explores the development of a tool, in the form of a R package, to ease the process of
learning clustering techniques, how they work and what their pros and cons are. This tool should provide
implementations for several different clustering techniques with explanations in order to allow the student
to get familiar with the characteristics of each algorithm by testing them against several different datasets
while deepening their understanding of them through the explanations. Additionally, these explanations
should adapt to the input data, making the tool not only adept for self-regulated learning but for teaching
too.Grado en IngenierĂa InformĂĄtic
An exploration of methodologies to improve semi-supervised hierarchical clustering with knowledge-based constraints
Clustering algorithms with constraints (also known as semi-supervised clustering algorithms) have been introduced to the field of machine learning as a significant variant to the conventional unsupervised clustering learning algorithms. They have been demonstrated to achieve better performance due to integrating prior knowledge during the clustering process, that enables uncovering relevant useful information from the data being clustered. However, the research conducted within the context of developing semi-supervised hierarchical clustering techniques are still an open and active investigation area. Majority of current semi-supervised clustering algorithms are developed as partitional clustering (PC) methods and only few research efforts have been made on developing semi-supervised hierarchical clustering methods. The aim of this research is to enhance hierarchical clustering (HC) algorithms based on prior knowledge, by adopting novel methodologies. [Continues.
Using Hierarchical Clustering for Learning the Ontologies used in Recommendation Systems
Ontologies are being successfully used to overcome semantic heterogeneity, and are becoming fundamental elements of the Semantic Web. Recently, it has also been shown that ontologies can be used to build more accurate and more personalized recommendation systems by inferencing missing user's preferences. However, these systems assume the existence of ontologies, without considering their construction. With product catalogs changing continuously, new techniques are required in order to build these ontologies in real time, and autonomously from any expert intervention. This paper focuses on this problem and show that it is possible to learn ontologies autonomously by using clustering algorithms. Results on the MovieLens and Jester data sets show that recommender system with learnt ontologies significantly outperform the classical recommendation approach
- âŠ