307,675 research outputs found
clValid: An R Package for Cluster Validation
The R package clValid contains functions for validating the results of a clustering analysis. There are three main types of cluster validation measures available, "internal", "stability", and "biological". The user can choose from nine clustering algorithms in existing R packages, including hierarchical, K-means, self-organizing maps (SOM), and model-based clustering. In addition, we provide a function to perform the self-organizing tree algorithm (SOTA) method of clustering. Any combination of validation measures and clustering methods can be requested in a single function call. This allows the user to simultaneously evaluate several clustering algorithms while varying the number of clusters, to help determine the most appropriate method and number of clusters for the dataset of interest. Additionally, the package can automatically make use of the biological information contained in the Gene Ontology (GO) database to calculate the biological validation measures, via the annotation packages available in Bioconductor. The function returns an object of S4 class "clValid", which has summary, plot, print, and additional methods which allow the user to display the optimal validation scores and extract clustering results.
Kernel spectral clustering of large dimensional data
This article proposes a first analysis of kernel spectral clustering methods
in the regime where the dimension of the data vectors to be clustered and
their number grow large at the same rate. We demonstrate, under a -class
Gaussian mixture model, that the normalized Laplacian matrix associated with
the kernel matrix asymptotically behaves similar to a so-called spiked random
matrix. Some of the isolated eigenvalue-eigenvector pairs in this model are
shown to carry the clustering information upon a separability condition
classical in spiked matrix models. We evaluate precisely the position of these
eigenvalues and the content of the eigenvectors, which unveil important
(sometimes quite disruptive) aspects of kernel spectral clustering both from a
theoretical and practical standpoints. Our results are then compared to the
actual clustering performance of images from the MNIST database, thereby
revealing an important match between theory and practice
MEMOFinder: combining _de_ _novo_ motif prediction methods with a database of known motifs
*Background:* Methods for finding overrepresented sequence motifs are useful in several key areas of computational biology. They aim at detecting very weak signals responsible for biological processes requiring robust sequence identification like transcription-factor binding to DNA or docking sites in proteins. Currently, general performance of the model-based motif-finding methods is unsatisfactory; however, different methods are successful in different cases. This leads to the practical problem of combining results of different motif-finding tools, taking into account current knowledge collected in motif databases.
*Results:* We propose a new complete service allowing researchers to submit their sequences for analysis by four different motif-finding methods for clustering and comparison with a reference motif database. It is tailored for regulatory motif detection, however it allows for substantial amount of configuration regarding sequence background, motif database and parameters for motif-finding methods.
*Availability:* The method is available online as a webserver at: http://bioputer.mimuw.edu.pl/software/mmf/. In addition, the source code is released on a GNU General Public License
- …