116,630 research outputs found

    Clustering of gene expression data: performance and similarity analysis

    Get PDF
    BACKGROUND: DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. The evaluation of feasible and applicable clustering algorithms is becoming an important issue in today's bioinformatics research. RESULTS: In this paper we first experimentally study three major clustering algorithms: Hierarchical Clustering (HC), Self-Organizing Map (SOM), and Self Organizing Tree Algorithm (SOTA) using Yeast Saccharomyces cerevisiae gene expression data, and compare their performance. We then introduce Cluster Diff, a new data mining tool, to conduct the similarity analysis of clusters generated by different algorithms. The performance study shows that SOTA is more efficient than SOM while HC is the least efficient. The results of similarity analysis show that when given a target cluster, the Cluster Diff can efficiently determine the closest match from a set of clusters. Therefore, it is an effective approach for evaluating different clustering algorithms. CONCLUSION: HC methods allow a visual, convenient representation of genes. However, they are neither robust nor efficient. The SOM is more robust against noise. A disadvantage of SOM is that the number of clusters has to be fixed beforehand. The SOTA combines the advantages of both hierarchical and SOM clustering. It allows a visual representation of the clusters and their structure and is not sensitive to noises. The SOTA is also more flexible than the other two clustering methods. By using our data mining tool, Cluster Diff, it is possible to analyze the similarity of clusters generated by different algorithms and thereby enable comparisons of different clustering methods

    A novel fuzzy clustering approach to regionalise watersheds with an automatic determination of optimal number of clusters

    Get PDF
    One of the most important problems faced in hydrology is the estimation of flood magnitudes and frequencies in ungauged basins. Hydrological regionalisation is used to transfer information from gauged watersheds to ungauged watersheds. However, to obtain reliable results, the watersheds involved must have a similar hydrological behaviour. In this study, two different clustering approaches are used and compared to identify the hydrologically homogeneous regions. Fuzzy C-Means algorithm (FCM), which is widely used for regionalisation studies, needs the calculation of cluster validity indices in order to determine the optimal number of clusters. Fuzzy Minimals algorithm (FM), which presents an advantage compared with others fuzzy clustering algorithms, does not need to know a priori the number of clusters, so cluster validity indices are not used. Regional homogeneity test based on L-moments approach is used to check homogeneity of regions identified by both cluster analysis approaches. The validation of the FM algorithm in deriving homogeneous regions for flood frequency analysis is illustrated through its application to data from the watersheds in Alto Genil (South Spain). According to the results, FM algorithm is recommended for identifying the hydrologically homogeneous regions for regional frequency analysis.IngenierĂ­a, Industria y ConstrucciĂł

    A Novel Clustering-Based Algorithm for Continuous and Non-invasive Cuff-Less Blood Pressure Estimation

    Full text link
    Extensive research has been performed on continuous, non-invasive, cuffless blood pressure (BP) measurement using artificial intelligence algorithms. This approach involves extracting certain features from physiological signals like ECG, PPG, ICG, BCG, etc. as independent variables and extracting features from Arterial Blood Pressure (ABP) signals as dependent variables, and then using machine learning algorithms to develop a blood pressure estimation model based on these data. The greatest challenge of this field is the insufficient accuracy of estimation models. This paper proposes a novel blood pressure estimation method with a clustering step for accuracy improvement. The proposed method involves extracting Pulse Transit Time (PTT), PPG Intensity Ratio (PIR), and Heart Rate (HR) features from Electrocardiogram (ECG) and Photoplethysmogram (PPG) signals as the inputs of clustering and regression, extracting Systolic Blood Pressure (SBP) and Diastolic Blood Pressure (DBP) features from ABP signals as dependent variables, and finally developing regression models by applying Gradient Boosting Regression (GBR), Random Forest Regression (RFR), and Multilayer Perceptron Regression (MLP) on each cluster. The method was implemented using the MIMICII dataset with the silhouette criterion used to determine the optimal number of clusters. The results showed that because of the inconsistency, high dispersion, and multi-trend behavior of the extracted features vectors, the accuracy can be significantly improved by running a clustering algorithm and then developing a regression model on each cluster, and finally weighted averaging of the results based on the error of each cluster. When implemented with 5 clusters and GBR, this approach yielded an MAE of 2.56 for SBP estimates and 2.23 for DBP estimates, which were significantly better than the best results without clustering (DBP: 6.27, SBP: 6.36)

    Clustering algorithms subjected to K-mean and gaussian mixture model on multidimensional data set

    Get PDF
    This paper explored the method of clustering. Two main categories of algorithms will be used, namely k-means and Gaussian Mixture Model clustering. We will look at algorithms within thesis categories and what types of problems they solve, as well as what methods could be used to determine the number of clusters. Finally, we will test the algorithms out using sparse multidimensional data acquired from the usage of a video games sales all around the world, we categories the sales in three main standards of high sales, medium sales and low sales, showing that a simple implementation can achieve nontrivial results. The result will be presented in the form of an evaluation of there is potential for online clustering of video games sales. We will also discuss some task specific improvements and which approach is most suitable

    Clustering methods of wind turbines and its application in short-term wind power forecasts

    Get PDF
    Commonly used wind power forecasts methods choose only one representative wind turbine to forecast the output power of the entire wind farm; however, this approach may reduce the forecasting accuracy. If each wind turbine in a wind farm is forecasted individually, this considerably increases the computational cost, especially for a large wind farm. In this work, a compromise approach is developed where the turbines in the wind farm are clustered and a forecast made for each cluster. Three clustering methods are evaluated: K-means; a self-organizing map (SOM); and spectral clustering (SC). At first, wind turbines in a wind farm are clustered into several groups by identifying similar characteristics of wind speed and output power. Sihouette coefficient and Hopkins statistics indices are adopted to determine the optimal cluster number which is an important parameter in cluster analysis. Next, forecasting models of the selected representative wind turbines for each cluster based on correlation analysis are established separately. A comparative study of the forecast effect is carried to determine the most effective clustering method. Results show that the short-term wind power forecasting on the basis of SOM and SC clustering are effective to forecast the output power of the entire wind farm with better accuracy, respectively, 1.67% and 1.43% than the forecasts using a single wind speed or power to represent the wind farm. Both Hopkins statistics and Sihouette coefficient are effective in choosing the optimal number of clusters. In addition, SOM with its higher forecast accuracy and SC with more efficient calculation when applied into wind power forecasts can provide guidance for the operating and dispatching of wind power. The emphasis of the paper is on the clustering methods and its effect applied in wind power forecasts but not the forecasting algorithms

    Model Based Automatic and Robust Spike Sorting for Large Volumes of Multi-channel Extracellular Data

    Get PDF
    abstract: Spike sorting is a critical step for single-unit-based analysis of neural activities extracellularly and simultaneously recorded using multi-channel electrodes. When dealing with recordings from very large numbers of neurons, existing methods, which are mostly semiautomatic in nature, become inadequate. This dissertation aims at automating the spike sorting process. A high performance, automatic and computationally efficient spike detection and clustering system, namely, the M-Sorter2 is presented. The M-Sorter2 employs the modified multiscale correlation of wavelet coefficients (MCWC) for neural spike detection. At the center of the proposed M-Sorter2 are two automatic spike clustering methods. They share a common hierarchical agglomerative modeling (HAM) model search procedure to strategically form a sequence of mixture models, and a new model selection criterion called difference of model evidence (DoME) to automatically determine the number of clusters. The M-Sorter2 employs two methods differing by how they perform clustering to infer model parameters: one uses robust variational Bayes (RVB) and the other uses robust Expectation-Maximization (REM) for Student’s -mixture modeling. The M-Sorter2 is thus a significantly improved approach to sorting as an automatic procedure. M-Sorter2 was evaluated and benchmarked with popular algorithms using simulated, artificial and real data with truth that are openly available to researchers. Simulated datasets with known statistical distributions were first used to illustrate how the clustering algorithms, namely REMHAM and RVBHAM, provide robust clustering results under commonly experienced performance degrading conditions, such as random initialization of parameters, high dimensionality of data, low signal-to-noise ratio (SNR), ambiguous clusters, and asymmetry in cluster sizes. For the artificial dataset from single-channel recordings, the proposed sorter outperformed Wave_Clus, Plexon’s Offline Sorter and Klusta in most of the comparison cases. For the real dataset from multi-channel electrodes, tetrodes and polytrodes, the proposed sorter outperformed all comparison algorithms in terms of false positive and false negative rates. The software package presented in this dissertation is available for open access.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    A fuzzy anomaly detection system based on hybrid PSO-Kmeans algorithm in content-centric networks

    Get PDF
    In Content-Centric Networks (CCNs) as a possible future Internet, new kinds of attacks and security challenges – from Denial of Service (DoS) to privacy attacks – will arise. An efficient and effective security mechanism is required to secure content and defense against unknown and new forms of attacks and anomalies. Usually, clustering algorithms would fit the requirements for building a good anomaly detection system. K-means is a popular anomaly detection method to classify data into different categories. However, it suffers from the local convergence and sensitivity to selection of the cluster centroids. In this paper, we present a novel fuzzy anomaly detection system that works in two phases. In the first phase – the training phase – we propose an hybridization of Particle Swarm Optimization (PSO) and K-means algorithm with two simultaneous cost functions as well-separated clusters and local optimization to determine the optimal number of clusters. When the optimal placement of clusters centroids and objects are defined, it starts the second phase. In this phase – the detection phase – we employ a fuzzy approach by the combination of two distance-based methods as classification and outlier to detect anomalies in new monitoring data. Experimental results demonstrate that the proposed algorithm can achieve to the optimal number of clusters, well-separated clusters, as well as increase the high detection rate and decrease the false positive rate at the same time when compared to some other well-known clustering algorithms

    Fuzzy clustering with volume prototypes and adaptive cluster merging

    Get PDF
    Two extensions to the objective function-based fuzzy clustering are proposed. First, the (point) prototypes are extended to hypervolumes, whose size can be fixed or can be determined automatically from the data being clustered. It is shown that clustering with hypervolume prototypes can be formulated as the minimization of an objective function. Second, a heuristic cluster merging step is introduced where the similarity among the clusters is assessed during optimization. Starting with an overestimation of the number of clusters in the data, similar clusters are merged in order to obtain a suitable partitioning. An adaptive threshold for merging is proposed. The extensions proposed are applied to Gustafson–Kessel and fuzzy c-means algorithms, and the resulting extended algorithm is given. The properties of the new algorithm are illustrated by various examples
    • …
    corecore