141,237 research outputs found

    STRETCHED RELIABLE METAHEURISTICS FOR ENHANCING K-MEANS

    Get PDF
    Cluster analysis is one of the primary data analysis methods and k-means is one of the most well known popular clustering algorithms. The k-means algorithm is one of the frequently used clustering method in data mining, due to its performance in clustering massive data sets. The final clustering result of the kmeans clustering algorithm greatly depends upon the correctness of the initial centroids, which are selected randomly. The original k-means algorithm converges to local minimum, not the global optimum. Many improvements were already proposed to improve the performance of the k-means, but most of these require additional inputs like threshold values for the number of data points in a set. In this paper a new method is proposed for finding the better initial centroids and to provide an efficient way of assigning the data points to suitable clusters with reduced time complexity. According to our experimental results, the proposed algorithm has the more accuracy with less computational time comparatively original k-means clustering algorithm

    Incremental Genetic K-means Algorithm and its Application in Gene Expression Data Analysis

    Get PDF
    Background In recent years, clustering algorithms have been effectively applied in molecular biology for gene expression data analysis. With the help of clustering algorithms such as K-means, hierarchical clustering, SOM, etc, genes are partitioned into groups based on the similarity between their expression profiles. In this way, functionally related genes are identified. As the amount of laboratory data in molecular biology grows exponentially each year due to advanced technologies such as Microarray, new efficient and effective methods for clustering must be developed to process this growing amount of biological data. Results In this paper, we propose a new clustering algorithm, Incremental Genetic K-means Algorithm (IGKA). IGKA is an extension to our previously proposed clustering algorithm, the Fast Genetic K-means Algorithm (FGKA). IGKA outperforms FGKA when the mutation probability is small. The main idea of IGKA is to calculate the objective value Total Within-Cluster Variation (TWCV) and to cluster centroids incrementally whenever the mutation probability is small. IGKA inherits the salient feature of FGKA of always converging to the global optimum. C program is freely available at http://database.cs.wayne.edu/proj/FGKA/index.htm. Conclusions Our experiments indicate that, while the IGKA algorithm has a convergence pattern similar to FGKA, it has a better time performance when the mutation probability decreases to some point. Finally, we used IGKA to cluster a yeast dataset and found that it increased the enrichment of genes of similar function within the cluster

    Optimization based clustering and classification algorithms in analysis of microarray gene expression data sets

    Get PDF
    Doctor of PhilosophyBioinformatics and computational biology are relatively new areas that involve the use of different techniques including computer science, informatics, biochemistry, applied math and etc., to solve biological problems. In recent years the development of new molecular genetics technologies, such as DNA microarrays led to the simultaneous measurement of expression levels of thousands and even tens of thousands of genes. Microarray gene expression technology has facilitated the study of genomic structure and investigation of biological systems. Numerical output of this technology is shown as microarray gene expression data sets. These data sets contain a very large number of genes and a relatively small number of samples and their precise analysis requires a robust and suitable computer software. Due to this, only a few existing algorithms are applicable to them, so more efficient methods for solving clustering, gene selection and classification problems of gene expression data sets are required and those methods need to be computationally applicable and less expensive. The aim of this thesis is to develop new algorithms for solving clustering, gene selection and data classification problems on gene expression data sets. Clustering in gene expression data sets is a challenging problem. The increasing use of DNA microarray-based tumour gene expression profiles for cancer diagnosis requires more efficient methods to solve clustering problems of these profiles. Different algorithms for clustering of genes have been proposed, however few algorithms can be applied to the clustering of samples. k-means algorithm, among very few clustering algorithms is applicable to microarray gene expression data sets, however these are not efficient for solving clustering problems when the number of genes is thousands and this algorithm is very sensitive to the choice of a starting point. Additionally, when the number of clusters is relatively large, this algorithm gives local minima which can differ significantly from the global solution. Over the last several years different approaches have been proposed to improve global ii Abstract Abstract search properties of k-means algorithm. One of them is the global k-means algorithm, however this algorithm is not efficient when data are sparse. In this thesis we developed a new version of the global k-means algorithm, the modified global k-means algorithm which is effective for solving clustering problems in gene expression data sets. In a microarray gene expression data set, in many cases only a small fraction of genes are informative whereas most of them are non-informative and make noise. Therefore the development of gene selection algorithms that allow us to remove as many non-informative genes as possible is very important. In this thesis we developed a new overlapping gene selection algorithm. This algorithm is based on calculating overlaps of different genes. It considerably reduces the number of genes and is efficient in finding a subset of informative genes. Over the last decade different approaches have been proposed to solve supervised data classification problems in gene expression data sets. In this thesis we developed a new approach which is based on the so-called max-min separability and is compared with the other approaches. The max-min separability algorithm is an equivalent of piecewise linear separability. An incremental algorithm is presented to compute piecewise linear functions separating two sets. This algorithm is applied along with a special gene selection algorithm. In this thesis, all new algorithms have been tested on 10 publicly available gene expression data sets and our numerical results demonstrate the efficiency of the new algorithms that were developed in the framework of this researc

    Interactive VPL-based global illumination on the GPU using fuzzy clustering

    Full text link
    Physically-based synthesis of high quality imagery, including global illumination light transport phenomena, results in a significant workload, which makes interactive rendering a very challenging task. We propose a VPL-based ray tracing approach that runs entirely in the GPU and achieves interactive frame rates while handling global illumination light transport phenomena. This approach is based on clustering both shading points and VPLs and computing visibility only among clusters' representatives. A new massively parallel K-means clustering algorithm, enables efficient execution in the GPU. Rendering artifacts, that could result from the piecewise constant approximation of the VPLs/shading points visibility function introduced by the clustering, are smoothed away by resorting to an innovative approach based on fuzzy clustering and weighted interpolation of the visibility function. The effectiveness of the proposed approach is experimentally verified for a collection of scenes, with frame rates larger than 3 fps and up to 25 fps being demonstrated

    Clustering analysis using Swarm Intelligence

    Get PDF
    This thesis is concerned with the application of the swarm intelligence methods in clustering analysis of datasets. The main objectives of the thesis are ∙ Take the advantage of a novel evolutionary algorithm, called artificial bee colony, to improve the capability of K-means in finding global optimum clusters in nonlinear partitional clustering problems. ∙ Consider partitional clustering as an optimization problem and an improved antbased algorithm, named Opposition-Based API (after the name of Pachycondyla APIcalis ants), to automatic grouping of large unlabeled datasets. ∙ Define partitional clustering as a multiobjective optimization problem. The aim is to obtain well-separated, connected, and compact clusters and for this purpose, two objective functions have been defined based on the concepts of data connectivity and cohesion. These functions are the core of an efficient multiobjective particle swarm optimization algorithm, which has been devised for and applied to automatic grouping of large unlabeled datasets. For that purpose, this thesis is divided is five main parts: ∙ The first part, including Chapter 1, aims at introducing state of the art of swarm intelligence based clustering methods. ∙ The second part, including Chapter 2, consists in clustering analysis with combination of artificial bee colony algorithm and K-means technique. ∙ The third part, including Chapter 3, consists in a presentation of clustering analysis using opposition-based API algorithm. ∙ The fourth part, including Chapter 4, consists in multiobjective clustering analysis using particle swarm optimization. ∙ Finally, the fifth part, including Chapter 5, concludes the thesis and addresses the future directions and the open issues of this research

    Clustering analysis using Swarm Intelligence

    Get PDF
    This thesis is concerned with the application of the swarm intelligence methods in clustering analysis of datasets. The main objectives of the thesis are ∙ Take the advantage of a novel evolutionary algorithm, called artificial bee colony, to improve the capability of K-means in finding global optimum clusters in nonlinear partitional clustering problems. ∙ Consider partitional clustering as an optimization problem and an improved antbased algorithm, named Opposition-Based API (after the name of Pachycondyla APIcalis ants), to automatic grouping of large unlabeled datasets. ∙ Define partitional clustering as a multiobjective optimization problem. The aim is to obtain well-separated, connected, and compact clusters and for this purpose, two objective functions have been defined based on the concepts of data connectivity and cohesion. These functions are the core of an efficient multiobjective particle swarm optimization algorithm, which has been devised for and applied to automatic grouping of large unlabeled datasets. For that purpose, this thesis is divided is five main parts: ∙ The first part, including Chapter 1, aims at introducing state of the art of swarm intelligence based clustering methods. ∙ The second part, including Chapter 2, consists in clustering analysis with combination of artificial bee colony algorithm and K-means technique. ∙ The third part, including Chapter 3, consists in a presentation of clustering analysis using opposition-based API algorithm. ∙ The fourth part, including Chapter 4, consists in multiobjective clustering analysis using particle swarm optimization. ∙ Finally, the fifth part, including Chapter 5, concludes the thesis and addresses the future directions and the open issues of this research

    On the Development of Machine Learning Based Real-Time Stress Monitoring : A Pilot Study

    Get PDF
    During specific environmental changes, the human body regulates itself through emotional, physical or mental responses. One such response is stress. The psychological and physical stability of an individual may be affected by recurrent occurrences of acute stress. This often leads to anxiety disorder, other psychological illnesses, hypertension, and other physiological disorders. The work performance of the individual is also negatively affected due to long-term stress. Across various age groups, the global population is primarily influenced by anxiety, depression and psychological stress. The long-term adverse effects of stress can be mitigated by effectively monitoring and managing stress through a cost-efficient and reliable stress detection system.  This paper mainly focuses on stress detection using a machine-learning approach. Wearable sensor data from electroencephalogram (EEG) and electrocardiogram (ECG) are considered during exposure to stress and the level of stress undergone by the participant is further analyzed. This approach helps in stress detection, analysis and mitigation, which in turn improves the quality life of people. Machining Learning technique k-means clustering algorithm is used after removal of artifacts to obtain case-specific clusters that segregate features pointing to non-stress and stress periods.  The results of the proposed K-means clustering algorithm are compared to state-of-the-art techniques such as Artificial Neural Network (ANN), Decision Tree (DT), Random Forest (RF) and Support Vector Machine (SVM). From the results, it was concluded that the proposed algorithm outperformed the other with an accuracy of 96% in the overall analysis

    Federated K-Means Clustering via Dual Decomposition-based Distributed Optimization

    Full text link
    The use of distributed optimization in machine learning can be motivated either by the resulting preservation of privacy or the increase in computational efficiency. On the one hand, training data might be stored across multiple devices. Training a global model within a network where each node only has access to its confidential data requires the use of distributed algorithms. Even if the data is not confidential, sharing it might be prohibitive due to bandwidth limitations. On the other hand, the ever-increasing amount of available data leads to large-scale machine learning problems. By splitting the training process across multiple nodes its efficiency can be significantly increased. This paper aims to demonstrate how dual decomposition can be applied for distributed training of K K -means clustering problems. After an overview of distributed and federated machine learning, the mixed-integer quadratically constrained programming-based formulation of the K K -means clustering training problem is presented. The training can be performed in a distributed manner by splitting the data across different nodes and linking these nodes through consensus constraints. Finally, the performance of the subgradient method, the bundle trust method, and the quasi-Newton dual ascent algorithm are evaluated on a set of benchmark problems. While the mixed-integer programming-based formulation of the clustering problems suffers from weak integer relaxations, the presented approach can potentially be used to enable an efficient solution in the future, both in a central and distributed setting
    • …
    corecore