353,608 research outputs found

    K-means algorithms for functional data

    Get PDF
    Cluster analysis of functional data considers that the objects on which you want to perform a taxonomy are functions f : X e Rp ↦R and the available information about each object is a sample in a finite set of points f ¼ fðx ; y ÞA X x Rgn . The aim is to infer the meaningful groups by working explicitly with its infinite-dimensional nature. In this paper the use of K-means algorithms to solve this problem is analysed. A comparative study of three K-means algorithms has been conducted. The K-means algorithm for raw data, a kernel K-means algorithm for raw data and a K-means algorithm using two distances for functional data are tested. These distances, called dVn and dϕ, are based on projections onto Reproducing Kernel Hilbert Spaces (RKHS) and Tikhonov regularization theory. Although it is shown that both distances are equivalent, they lead to two different strategies to reduce the dimensionality of the data. In the case of dVn distance the most suitable strategy is Johnson–Lindenstrauss random projections. The dimensionality reduction for dϕ is based on spectral methods

    Simultaneous and Single Gene Expression: Computational Analysis for Malaria Treatment Discovery

    Get PDF
    The major aim of this work is to develop an efficient and effective k-means algorithm to cluster malaria microarray data to enable the extraction of a functional relationship of genes for malaria treatment discovery. However, traditional k-means and most k-means variants are still computationally expensive for large datasets such as microarray data, which have large datasets with a large dimension size d. Huge data is generated and biologists have the challenge of extracting useful information from volumes of microarray data. Firstly, in this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Using our method, the new k-means algorithm is able to save significant computation time at each iteration and thus arrive at an O(nk2) expected run time. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We further prove that our algorithm is correct theoretically. Results obtained from testing the algorithm on three biological data and three non-biological data also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of known structure using the Hubert-Arabie Adjusted Rand index (ARIHA), we found that when k is close to d, the quality is good (ARIHA > 0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA > 0.9). We compare three different k-means algorithms including our novel Metric Matrics k-means (MMk-means), results from an in-vitro microarray data with the classification from an in-vivo microarray data in order to perform a comparative functional classification of P. falciparum genes and further validate the effectiveness of our MMk-means algorithm. Results from this study indicate that the resulting distribution of the comparison of the three algorithms’ in- vitro clusters against the in-vivo clusters is similar, thereby authenticating our MMk-means method and its effectiveness. Lastly using clustering, R programming (with Wilcoxon statistical test on this platform) and the new microarray data of P. yoelli at the liver stage and the P. falciparum microarray data at the blood stages, we extracted twenty nine (29) viable P. falciparum and P. yoelli genes that can be used for designing a Polymerase Chain Reaction (PCR) primer experiment for the detection of malaria at the liver stage. Due to the intellectual property right, we are unable to list these genes here

    IMPROVEMENT OF DATA ANALYSIS BASED ON K-MEANS ALGORITHM AND AKMCA

    Get PDF
    Data analysis is improved using the k-means algorithm and AKMCA. Data mining aims to extract information from a large data set and transform it into a functional structure. Exploratory data analysis and data mining applications rely heavily on clustering. Clustering is grouping a set of objects so that those in the same group (called a cluster) are more similar to those in other groups (clusters). There are various types of cluster models, such as connectivity models, distribution models, centroid models, and density models. Clustering is a technique in data mining in which the set of objects is classified as clusters. Clustering is the most important aspect of data mining. The algorithm makes use of the density number concept. The high-density number point set is extracted from the original data set as a new training set, and the point in the high-density number point set is chosen as the initial cluster centre point. The basic clustering technique and the most widely used algorithm is K-means clustering. K-Means, a partition-based clustering algorithm, is widely used in many fields due to its efficiency and simplicity. However, it is well known that the K-Means algorithm can produce suboptimal results depending on the initial cluster centre chosen. It is also referred to as Looking for the nearest neighbours. It simply divides the datasets into a specified number of clusters. Numerous efforts have been made to improve the K-means clustering algorithm’s performance. Advanced k-mean clustering algorithm (AKMCA) is used in data analysis to obtain useful knowledge of various optimisation and classification problems that can be used for processing massive amounts of raw and unstructured data. Knowledge discovery provides the tools needed to automate the entire data analysis and error reduction process, where their efficacy is investigated using experimental analysis of various datasets. The detailed experimental analysis and a comparison of proposed work with existing k-means clustering algorithms. Furthermore, it provides a clear and comprehensive understanding of the k-means algorithm and its various research directions

    Clustering in high dimension for multivariate and functional data using extreme kurtosis projections

    Get PDF
    Cluster analysis is a problem that consists of the analysis of the existence of clusters in a multivariate sample. This analysis is performed by algorithms that differ significantly in their notion of what constitutes a cluster and how to find them efficiently. In this thesis we are interested in large data problems and therefore we consider algorithms that use dimension reduction techniques for the identification of interesting structures in large data sets. Particularly in those algorithms that use the kurtosis coefficient to detect the clusters present in the data. The thesis extends the work of Peña and Prieto (2001a) of identifying clusters in multivariate data using the univariate projections of the sample data on the directions that minimize and maximize the kurtosis coefficient of the projected data, and Peña et al. (2010) who used the eigenvalues of a kurtosis matrix to reduce the dimension. This thesis has two main contributions: First, we prove that the extreme kurtosis projections have some optimality properties for mixtures of normal distributions and we propose an algorithm to identify clusters when the data dimension and the number of clusters present in the sample are high. The good performance of the algorithm is shown through a simulations study where it is compared it with MCLUST, K-means and CLARA methods. Second, we propose the extension of multivariate kurtosis for functional data, and we analyze some of its properties for clustering. Additionally, we propose an algorithm based on kurtosis projections for functional data. Its good properties are compared with the results obtained by Functional Principal Components, Functional K-means and FunClust method. The thesis is structured as follows: Chapter 1 is an introductory Chapter where we will review some theoretical concepts that will be used throughout the thesis. In Chapter 2 we review in detail the concept of kurtosis. We study the properties of kurtosis. Give a detailed description of some algorithms proposed in the literature that use the kurtosis coefficient to detect the clusters present in the data. In Chapter 3 we study the directions that may be interesting for the detection of several clusters in the sample and we analyze how the extreme kurtosis directions are related to these directions. In addition, we present a clustering algorithm for high-dimensional data using extreme kurtosis directions. In Chapter 4 we introduce an extension of the multivariate kurtosis for the functional data and we analyze the properties of this measure regarding the identification of clusters. In addition, we present a clustering algorithm for functional data using extreme kurtosis directions. We finish with some remarks and conclusions in the final Chapter.Programa Oficial de Doctorado en Ingeniería MatemáticaPresidente: Ana María Justel Eusebio.- Secretario: Andrés Modesto Alonso Fernández.- Vocal: José Manuel Mira Mcwilliam

    Evaluation of clustering algorithms for gene expression data

    Get PDF
    BACKGROUND: Cluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. A closely related problem is that of selecting a clustering algorithm that is "optimal" in some sense from a rather impressive list of clustering algorithms that currently exist. RESULTS: In this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional congruence. Smaller values of these indices indicate better performance for a clustering algorithm. We illustrate this approach using two case studies with publicly available gene expression data sets: one involving a SAGE data of breast cancer patients and the other involving a time course cDNA microarray data on yeast. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated. CONCLUSION: No single clustering algorithm may be best suited for clustering genes into functional groups via expression profiles for all data sets. The validation measures introduced in this paper can aid in the selection of an optimal algorithm, for a given data set, from a collection of available clustering algorithms

    Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

    Get PDF
    Standard clustering algorithms usually find regular-structured clusters such as ellipsoidally- or spherically-dispersed groups, but are more challenged with groups lacking formal structure or definition. Syncytial clustering is the name that we introduce for methods that merge groups obtained from standard clustering algorithms in order to reveal complex group structure in the data. Here, we develop a distribution-free fully-automated syncytial clustering algorithm that can be used with k-means and other algorithms. Our approach computes the cumulative distribution function of the normed residuals from an appropriately fit k-groups model and calculates the nonparametric overlap between each pair of groups. Groups with high pairwise overlaps are merged as long as the generalized overlap decreases. Our methodology is always a top performer in identifying groups with regular and irregular structures in several datasets. The approach is also used to identify the distinct kinds of gamma ray bursts in the Burst and Transient Source Experiment 4Br catalog and also the distinct kinds of activation in a functional Magnetic Resonance Imaging study

    Which fMRI clustering gives good brain parcellations?

    Get PDF
    International audienceAnalysis and interpretation of neuroimaging data often require one to divide the brain into a number of regions, or parcels, with homogeneous characteristics, be these regions defined in the brain volume or on on the cortical surface. While predefined brain atlases do not adapt to the signal in the individual subjects images, parcellation approaches use brain activity (e.g. found in some functional contrasts of interest) and clustering techniques to define regions with some degree of signal homogeneity. In this work, we address the question of which clustering technique is appropriate and how to optimize the corresponding model. We use two principled criteria: goodness of fit (accuracy), and reproducibility of the parcellation across bootstrap samples. We study these criteria on both simulated and two task-based functional Magnetic Resonance Imaging datasets for the Ward, spectral and K-means clustering algorithms. We show that in general Ward's clustering performs better than alternative methods with regard to reproducibility and accuracy and that the two criteria diverge regarding the preferred models (reproducibility leading to more conservative solutions), thus deferring the practical decision to a higher level alternative, namely the choice of a trade-off between accuracy and stability

    Development of a Python package for Functional Data Analysis. Depth measures, applications and clustering

    Full text link
    In this paper, the problem of analyzing functional data is addressed. Each observation in functional data is a function that varies over a continuum. This kind of complex data is increasingly becoming more common in many research fields. However, Functional Data Analysis (FDA) is a relatively recent field in which software implementations are basically limited to R. In addition, although they may follow an open-source scheme, the contribution to them may turn out to be complicated. The final goal of this project is to provide a comprehensive Python package for Functional Data Analysis, scikit-fda. In this undergraduate thesis, the functionality implemented in the package includes functional depth measures together with their applications and elementary notions of clustering. In a functional space, establishing an order is complicated due to its nature. Depth measures allow to define robust statistics for functional data. In the package you can find some of the most common, Fraiman and Muniz depth measure, the band depth measure or a modification of the latter, the modified band depth. Depth measures are used in the construction of graphic tools, both the functional boxplot and the magnitude-shape plot are introduced in the package along with their outlier detection procedures. Furthermore, contributions in the area of machine learning are made in which basic clustering algorithms are added to the package: K-means and Fuzzy K-means. Finally, the results of applying these methods to the Canadian Weather dataset are shown. The Python package is published in a GitHub repository. It is open-source wth the aim of growing and being kept up to date. In the long term it is expected to cover the fundamental techniques in FDA and become a widely-used toolbox for research in FDA
    corecore