    Clustering Algorithms: Their Application to Gene Expression Data

    Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and iden-tify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure

    Disease diagnosis in smart healthcare: Innovation, technologies and applications

    To promote sustainable development, the smart city implies a global vision that merges artificial intelligence, big data, decision making, information and communication technology (ICT), and the internet-of-things (IoT). The ageing issue is an aspect that researchers, companies and government should devote efforts in developing smart healthcare innovative technology and applications. In this paper, the topic of disease diagnosis in smart healthcare is reviewed. Typical emerging optimization algorithms and machine learning algorithms are summarized. Evolutionary optimization, stochastic optimization and combinatorial optimization are covered. Owning to the fact that there are plenty of applications in healthcare, four applications in the field of diseases diagnosis (which also list in the top 10 causes of global death in 2015), namely cardiovascular diseases, diabetes mellitus, Alzheimer’s disease and other forms of dementia, and tuberculosis, are considered. In addition, challenges in the deployment of disease diagnosis in healthcare have been discussed

    A review of clustering techniques and developments

    © 2017 Elsevier B.V. This paper presents a comprehensive study on clustering: exiting methods and developments made at various times. Clustering is defined as an unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. There are different methods for clustering the objects such as hierarchical, partitional, grid, density based and model based. The approaches used in these methods are discussed with their respective states of art and applicability. The measures of similarity as well as the evaluation criteria, which are the central components of clustering, are also presented in the paper. The applications of clustering in some fields like image segmentation, object and character recognition and data mining are highlighted

    BigFCM: Fast, Precise and Scalable FCM on Hadoop

    Clustering plays an important role in mining big data both as a modeling technique and a preprocessing step in many data mining process implementations. Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing each data record to belong to more than one cluster to some degree. However, a serious challenge in fuzzy clustering is the lack of scalability. Massive datasets in emerging fields such as geosciences, biology and networking do require parallel and distributed computations with high performance to solve real-world problems. Although some clustering methods are already improved to execute on big data platforms, but their execution time is highly increased for large datasets. In this paper, a scalable Fuzzy C-Means (FCM) clustering named BigFCM is proposed and designed for the Hadoop distributed data platform. Based on the map-reduce programming model, it exploits several mechanisms including an efficient caching design to achieve several orders of magnitude reduction in execution time. Extensive evaluation over multi-gigabyte datasets shows that BigFCM is scalable while it preserves the quality of clustering

    Aspect Mining Using Multiobjective Genetic Clustering Algorithms

    In legacy software, non-functional concerns tend to cut across the system and manifest themselves as tangled or scattered code. If these crosscutting concerns could be modularized and the system refactored, then the system would become easier to understand, modify, and maintain. Modularized crosscutting concerns are known as aspects and the process of identifying aspect candidates in legacy software is called aspect mining. One of the techniques used in aspect mining is clustering and there are many clustering algorithms. Current aspect mining clustering algorithms attempt to form clusters by optimizing one objective function. However, the objective function to be optimized tends to bias the formation of clusters towards the data model implicitly defined by that function. One solution is to use algorithms that try to optimize more than one objective function. These multiobjective algorithms have been used successfully in data mining but, as far as this author knows, have not been applied to aspect mining. This study investigated the feasibility of using multiobjective evolutionary algorithms, in particular, multiobjective genetic algorithms, in aspect mining. The study utilized an existing multiobjective genetic algorithm, MOCK, which had already been tested against several popular single objective clustering algorithms. MOCK has been shown to be, on average, as good as, and sometimes better than, those algorithms. Since some of those data mining algorithms have counterparts in aspect mining, it was reasonable to assume that MOCK would perform at least as good in an aspect mining context. Since MOCK\u27s objective functions were not directly trying to optimize aspect mining metrics, the study also implemented another multiobjective genetic algorithm, AMMOC, based on MOCK but tailored to optimize those metrics. The reasoning hinged on the fact that, since the goal was to determine if a clustering method resulted in optimizing these quality metrics, it made sense to attempt to optimize these functions directly instead of a posteriori. This study determined that these multiobjective algorithms performed at least as good as two popular aspect mining algorithms, k-means and hierarchical agglomerative. As a result, this study has contributed to both the theoretical body of knowledge in the field of aspect mining as well as provide a practical tool for the field

    Multi-objective evolutionary algorithms for data clustering

    In this work we investigate the use of Multi-Objective metaheuristics for the data-mining task of clustering. We �first investigate methods of evaluating the quality of clustering solutions, we then propose a new Multi-Objective clustering algorithm driven by multiple measures of cluster quality and then perform investigations into the performance of different Multi-Objective clustering algorithms. In the context of clustering, a robust measure for evaluating clustering solutions is an important component of an algorithm. These Cluster Quality Measures (CQMs) should rely solely on the structure of the clustering solution. A robust CQM should have three properties: it should be able to reward a \good" clustering solution; it should decrease in value monotonically as the solution quality deteriorates and, it should be able to evaluate clustering solutions with varying numbers of clusters. We review existing CQMs and present an experimental evaluation of their robustness. We find that measures based on connectivity are more robust than other measures for cluster evaluation. We then introduce a new Multi-Objective Clustering algorithm (MOCA). The use of Multi-Objective optimisation in clustering is desirable because it permits the incorporation of multiple measures of cluster quality. Since the definition of what constitutes a good clustering is far from clear, it is beneficial to develop algorithms that allow for multiple CQMs to be accommodated. The selection of the clustering quality measures to use as objectives for MOCA is informed by our previous work with internal evaluation measures. We explain the implementation details and perform experimental work to establish its worth. We compare MOCA with k-means and find some promising results. We�find that MOCA can generate a pool of clustering solutions that is more likely to contain the optimal clustering solution than the pool of solutions generated by k-means. We also perform an investigation into the performance of different implementations of MOEA algorithms for clustering. We�find that representations of clustering based around centroids and medoids produce more desirable clustering solutions and Pareto fronts. We also �find that mutation operators that greatly disrupt the clustering solutions lead to better exploration of the Pareto front whereas mutation operators that modify the clustering solutions in a more moderate way lead to higher quality clustering solutions. We then perform more specific investigations into the performance of mutation operators focussing on operators that promote clustering solution quality, exploration of the Pareto front and a hybrid combination. We use a number of techniques to assess the performance of the mutation operators as the algorithms execute. We confirm that a disruptive mutation operator leads to better exploration of the Pareto front and mutation operators that modify the clustering solutions lead to the discovery of higher quality clustering solutions. We find that our implementation of a hybrid mutation operator does not lead to a good improvement with respect to the other mutation operators but does show promise for future work

    Facing-up Challenges of Multiobjective Clustering Based on Evolutionary Algorithms: Representations, Scalability and Retrieval Solutions

    Aquesta tesi es centra en algorismes de clustering multiobjectiu, que estan basats en optimitzar varis objectius simultàniament obtenint una col•lecció de solucions potencials amb diferents compromisos entre objectius. El propòsit d'aquesta tesi consisteix en dissenyar i implementar un nou algorisme de clustering multiobjectiu basat en algorismes evolutius per afrontar tres reptes actuals relacionats amb aquest tipus de tècniques. El primer repte es centra en definir adequadament l'àrea de possibles solucions que s'explora per obtenir la millor solució i que depèn de la representació del coneixement. El segon repte consisteix en escalar el sistema dividint el conjunt de dades original en varis subconjunts per treballar amb menys dades en el procés de clustering. El tercer repte es basa en recuperar la solució més adequada tenint en compte la qualitat i la forma dels clusters a partir de la regió més interessant de la col•lecció de solucions ofertes per l’algorisme.Esta tesis se centra en los algoritmos de clustering multiobjetivo, que están basados en optimizar varios objetivos simultáneamente obteniendo una colección de soluciones potenciales con diferentes compromisos entre objetivos. El propósito de esta tesis consiste en diseñar e implementar un nuevo algoritmo de clustering multiobjetivo basado en algoritmos evolutivos para afrontar tres retos actuales relacionados con este tipo de técnicas. El primer reto se centra en definir adecuadamente el área de posibles soluciones explorada para obtener la mejor solución y que depende de la representación del conocimiento. El segundo reto consiste en escalar el sistema dividiendo el conjunto de datos original en varios subconjuntos para trabajar con menos datos en el proceso de clustering El tercer reto se basa en recuperar la solución más adecuada según la calidad y la forma de los clusters a partir de la región más interesante de la colección de soluciones ofrecidas por el algoritmo.This thesis is focused on multiobjective clustering algorithms, which are based on optimizing several objectives simultaneously obtaining a collection of potential solutions with different trade¬offs among objectives. The goal of the thesis is to design and implement a new multiobjective clustering technique based on evolutionary algorithms for facing up three current challenges related to these techniques. The first challenge is focused on successfully defining the area of possible solutions that is explored in order to find the best solution, and this depends on the knowledge representation. The second challenge tries to scale-up the system splitting the original data set into several data subsets in order to work with less data in the clustering process. The third challenge is addressed to the retrieval of the most suitable solution according to the quality and shape of the clusters from the most interesting region of the collection of solutions returned by the algorithm

    Development of a R package to facilitate the learning of clustering techniques

    This project explores the development of a tool, in the form of a R package, to ease the process of learning clustering techniques, how they work and what their pros and cons are. This tool should provide implementations for several different clustering techniques with explanations in order to allow the student to get familiar with the characteristics of each algorithm by testing them against several different datasets while deepening their understanding of them through the explanations. Additionally, these explanations should adapt to the input data, making the tool not only adept for self-regulated learning but for teaching too.Grado en Ingeniería Informátic