353,608 research outputs found
K-means algorithms for functional data
Cluster analysis of functional data considers that the objects on which you want to perform a taxonomy
are functions f : X e Rp ↦R and the available information about each object is a sample in a finite set of points f ¼ fðx ; y ÞA X x Rgn . The aim is to infer the meaningful groups by working explicitly with its infinite-dimensional nature.
In this paper the use of K-means algorithms to solve this problem is analysed. A comparative study of three K-means algorithms has been conducted. The K-means algorithm for raw data, a kernel K-means algorithm for raw data and a K-means algorithm using two distances for functional data are tested. These distances, called dVn and dϕ, are based on projections onto Reproducing Kernel Hilbert Spaces (RKHS) and Tikhonov regularization theory. Although it is shown that both distances are equivalent, they lead to two different strategies to reduce the dimensionality of the data. In the case of dVn distance the most suitable strategy is Johnson–Lindenstrauss random projections. The dimensionality reduction for dϕ is based on spectral methods
Simultaneous and Single Gene Expression: Computational Analysis for Malaria Treatment Discovery
The major aim of this work is to develop an efficient and effective k-means algorithm to
cluster malaria microarray data to enable the extraction of a functional relationship of
genes for malaria treatment discovery. However, traditional k-means and most k-means
variants are still computationally expensive for large datasets such as microarray data,
which have large datasets with a large dimension size d. Huge data is generated and
biologists have the challenge of extracting useful information from volumes of microarray
data. Firstly, in this work, we develop a novel k-means algorithm, which is simple but
more efficient than the traditional k-means and the recent enhanced k-means. Using our
method, the new k-means algorithm is able to save significant computation time at each
iteration and thus arrive at an O(nk2) expected run time. Our new algorithm is based on the
recently established relationship between principal component analysis and the k-means
clustering. We further prove that our algorithm is correct theoretically. Results obtained
from testing the algorithm on three biological data and three non-biological data also
indicate that our algorithm is empirically faster than other known k-means algorithms. We
assessed the quality of our algorithm clusters against the clusters of known structure using
the Hubert-Arabie Adjusted Rand index (ARIHA), we found that when k is close to d, the
quality is good (ARIHA > 0.8) and when k is not close to d, the quality of our new k-means
algorithm is excellent (ARIHA > 0.9). We compare three different k-means algorithms
including our novel Metric Matrics k-means (MMk-means), results from an in-vitro
microarray data with the classification from an in-vivo microarray data in order to perform
a comparative functional classification of P. falciparum genes and further validate the
effectiveness of our MMk-means algorithm. Results from this study indicate that the
resulting distribution of the comparison of the three algorithms’ in- vitro clusters against
the in-vivo clusters is similar, thereby authenticating our MMk-means method and its
effectiveness. Lastly using clustering, R programming (with Wilcoxon statistical test on
this platform) and the new microarray data of P. yoelli at the liver stage and the P.
falciparum microarray data at the blood stages, we extracted twenty nine (29) viable P.
falciparum and P. yoelli genes that can be used for designing a Polymerase Chain
Reaction (PCR) primer experiment for the detection of malaria at the liver stage. Due to
the intellectual property right, we are unable to list these genes here
IMPROVEMENT OF DATA ANALYSIS BASED ON K-MEANS ALGORITHM AND AKMCA
Data analysis is improved using the k-means algorithm and AKMCA. Data mining aims to extract information from a large data set and transform it into a functional structure. Exploratory data analysis and data mining applications rely heavily on clustering. Clustering is grouping a set of objects so that those in the same group (called a cluster) are more similar to those in other groups (clusters). There are various types of cluster models, such as connectivity models, distribution models, centroid models, and density models. Clustering is a technique in data mining in which the set of objects is classified as clusters. Clustering is the most important aspect of data mining. The algorithm makes use of the density number concept. The high-density number point set is extracted from the original data set as a new training set, and the point in the high-density number point set is chosen as the initial cluster centre point. The basic clustering technique and the most widely used algorithm is K-means clustering.
K-Means, a partition-based clustering algorithm, is widely used in many fields due to its efficiency and simplicity. However, it is well known that the K-Means algorithm can produce suboptimal results depending on the initial cluster centre chosen. It is also referred to as Looking for the nearest neighbours. It simply divides the datasets into a specified number of clusters. Numerous efforts have been made to improve the K-means clustering algorithm’s performance. Advanced k-mean clustering algorithm (AKMCA) is used in data analysis to obtain useful knowledge of various optimisation and classification problems that can be used for processing massive amounts of raw and unstructured data. Knowledge discovery provides the tools needed to automate the entire data analysis and error reduction process, where their efficacy is investigated using experimental analysis of various datasets. The detailed experimental analysis and a comparison of proposed work with existing k-means clustering algorithms. Furthermore, it provides a clear and comprehensive understanding of the k-means algorithm and its various research directions
Clustering in high dimension for multivariate and functional data using extreme kurtosis projections
Cluster analysis is a problem that consists of the analysis of the existence of
clusters in a multivariate sample. This analysis is performed by algorithms that
differ significantly in their notion of what constitutes a cluster and how to find them
efficiently. In this thesis we are interested in large data problems and therefore we
consider algorithms that use dimension reduction techniques for the identification
of interesting structures in large data sets. Particularly in those algorithms that
use the kurtosis coefficient to detect the clusters present in the data.
The thesis extends the work of Peña and Prieto (2001a) of identifying clusters
in multivariate data using the univariate projections of the sample data on the
directions that minimize and maximize the kurtosis coefficient of the projected
data, and Peña et al. (2010) who used the eigenvalues of a kurtosis matrix to
reduce the dimension.
This thesis has two main contributions:
First, we prove that the extreme kurtosis projections have some optimality
properties for mixtures of normal distributions and we propose an algorithm to
identify clusters when the data dimension and the number of clusters present in
the sample are high. The good performance of the algorithm is shown through a
simulations study where it is compared it with MCLUST, K-means and CLARA
methods.
Second, we propose the extension of multivariate kurtosis for functional data, and we analyze some of its properties for clustering. Additionally, we propose an
algorithm based on kurtosis projections for functional data. Its good properties
are compared with the results obtained by Functional Principal Components,
Functional K-means and FunClust method.
The thesis is structured as follows: Chapter 1 is an introductory Chapter where
we will review some theoretical concepts that will be used throughout the thesis.
In Chapter 2 we review in detail the concept of kurtosis. We study the
properties of kurtosis. Give a detailed description of some algorithms proposed
in the literature that use the kurtosis coefficient to detect the clusters present in
the data.
In Chapter 3 we study the directions that may be interesting for the detection
of several clusters in the sample and we analyze how the extreme kurtosis directions
are related to these directions. In addition, we present a clustering algorithm for
high-dimensional data using extreme kurtosis directions.
In Chapter 4 we introduce an extension of the multivariate kurtosis for the
functional data and we analyze the properties of this measure regarding the
identification of clusters. In addition, we present a clustering algorithm for
functional data using extreme kurtosis directions.
We finish with some remarks and conclusions in the final Chapter.Programa Oficial de Doctorado en Ingeniería MatemáticaPresidente: Ana María Justel Eusebio.- Secretario: Andrés Modesto Alonso Fernández.- Vocal: José Manuel Mira Mcwilliam
Evaluation of clustering algorithms for gene expression data
BACKGROUND: Cluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. A closely related problem is that of selecting a clustering algorithm that is "optimal" in some sense from a rather impressive list of clustering algorithms that currently exist. RESULTS: In this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional congruence. Smaller values of these indices indicate better performance for a clustering algorithm. We illustrate this approach using two case studies with publicly available gene expression data sets: one involving a SAGE data of breast cancer patients and the other involving a time course cDNA microarray data on yeast. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated. CONCLUSION: No single clustering algorithm may be best suited for clustering genes into functional groups via expression profiles for all data sets. The validation measures introduced in this paper can aid in the selection of an optimal algorithm, for a given data set, from a collection of available clustering algorithms
Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering
Standard clustering algorithms usually find regular-structured clusters such as ellipsoidally- or spherically-dispersed groups, but are more challenged with groups lacking formal structure or definition. Syncytial clustering is the name that we introduce for methods that merge groups obtained from standard clustering algorithms in order to reveal complex group structure in the data. Here, we develop a distribution-free fully-automated syncytial clustering algorithm that can be used with k-means and other algorithms. Our approach computes the cumulative distribution function of the normed residuals from an appropriately fit k-groups model and calculates the nonparametric overlap between each pair of groups. Groups with high pairwise overlaps are merged as long as the generalized overlap decreases. Our methodology is always a top performer in identifying groups with regular and irregular structures in several datasets. The approach is also used to identify the distinct kinds of gamma ray bursts in the Burst and Transient Source Experiment 4Br catalog and also the distinct kinds of activation in a functional Magnetic Resonance Imaging study
Which fMRI clustering gives good brain parcellations?
International audienceAnalysis and interpretation of neuroimaging data often require one to divide the brain into a number of regions, or parcels, with homogeneous characteristics, be these regions defined in the brain volume or on on the cortical surface. While predefined brain atlases do not adapt to the signal in the individual subjects images, parcellation approaches use brain activity (e.g. found in some functional contrasts of interest) and clustering techniques to define regions with some degree of signal homogeneity. In this work, we address the question of which clustering technique is appropriate and how to optimize the corresponding model. We use two principled criteria: goodness of fit (accuracy), and reproducibility of the parcellation across bootstrap samples. We study these criteria on both simulated and two task-based functional Magnetic Resonance Imaging datasets for the Ward, spectral and K-means clustering algorithms. We show that in general Ward's clustering performs better than alternative methods with regard to reproducibility and accuracy and that the two criteria diverge regarding the preferred models (reproducibility leading to more conservative solutions), thus deferring the practical decision to a higher level alternative, namely the choice of a trade-off between accuracy and stability
Development of a Python package for Functional Data Analysis. Depth measures, applications and clustering
In this paper, the problem of analyzing functional data is addressed. Each observation in functional
data is a function that varies over a continuum. This kind of complex data is increasingly becoming
more common in many research fields. However, Functional Data Analysis (FDA) is a relatively recent
field in which software implementations are basically limited to R. In addition, although they may follow
an open-source scheme, the contribution to them may turn out to be complicated. The final goal of this
project is to provide a comprehensive Python package for Functional Data Analysis, scikit-fda.
In this undergraduate thesis, the functionality implemented in the package includes functional depth
measures together with their applications and elementary notions of clustering. In a functional space,
establishing an order is complicated due to its nature. Depth measures allow to define robust statistics
for functional data. In the package you can find some of the most common, Fraiman and Muniz depth
measure, the band depth measure or a modification of the latter, the modified band depth. Depth measures
are used in the construction of graphic tools, both the functional boxplot and the magnitude-shape
plot are introduced in the package along with their outlier detection procedures. Furthermore, contributions
in the area of machine learning are made in which basic clustering algorithms are added to the
package: K-means and Fuzzy K-means. Finally, the results of applying these methods to the Canadian
Weather dataset are shown.
The Python package is published in a GitHub repository. It is open-source wth the aim of growing
and being kept up to date. In the long term it is expected to cover the fundamental techniques in FDA
and become a widely-used toolbox for research in FDA
- …