103 research outputs found
Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm
The Ward error sum of squares hierarchical clustering method has been very
widely used since its first description by Ward in a 1963 publication. It has
also been generalized in various ways. However there are different
interpretations in the literature and there are different implementations of
the Ward agglomerative algorithm in commonly used software systems, including
differing expressions of the agglomerative criterion. Our survey work and case
studies will be useful for all those involved in developing software for data
analysis using Ward's hierarchical clustering method.Comment: 20 pages, 21 citations, 4 figure
MaxMin Linear Initialization for Fuzzy C-Means
International audienceClustering is an extensive research area in data science. The aim of clustering is to discover groups and to identify interesting patterns in datasets. Crisp (hard) clustering considers that each data point belongs to one and only one cluster. However, it is inadequate as some data points may belong to several clusters, as is the case in text categorization. Thus, we need more flexible clustering. Fuzzy clustering methods, where each data point can belong to several clusters, are an interesting alternative. Yet, seeding iterative fuzzy algorithms to achieve high quality clustering is an issue. In this paper, we propose a new linear and efficient initialization algorithm MaxMin Linear to deal with this problem. Then, we validate our theoretical results through extensive experiments on a variety of numerical real-world and artificial datasets. We also test several validity indices, including a new validity index that we propose, Transformed Standardized Fuzzy Difference (TSFD)
Commonality Preserving Multiple Instance Clustering Based on Diverse Density
Abstract. Image-set clustering is a problem decomposing a given im-age set into disjoint subsets satisfying specied criteria. For single vector image representations, proximity or similarity criterion is widely applied, i.e., proximal or similar images form a cluster. Recent trend of the im-age description, however, is the local feature based, i.e., an image is described by multiple local features, e.g., SIFT, SURF, and so on. In this description, which criterion should be employed for the clustering? As an answer to this question, this paper presents an image-set clus-tering method based on commonality, that is, images preserving strong commonality (coherent local features) form a cluster. In this criterion, image variations that do not affect common features are harmless. In the case of face images, hair-style changes and partial occlusions by glasses may not affect the cluster formation. We dened four commonality mea-sures based on Diverse Density, that are used in agglomerative clustering. Through comparative experiments, we conrmed that two of our meth-ods perform better than other methods examined in the experiments.
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
Genomic distance entrained clustering and regression modelling highlights interacting genomic regions contributing to proliferation in breast cancer
<p>Abstract</p> <p>Background</p> <p>Genomic copy number changes and regional alterations in epigenetic states have been linked to grade in breast cancer. However, the relative contribution of specific alterations to the pathology of different breast cancer subtypes remains unclear. The heterogeneity and interplay of genomic and epigenetic variations means that large datasets and statistical data mining methods are required to uncover recurrent patterns that are likely to be important in cancer progression.</p> <p>Results</p> <p>We employed ridge regression to model the relationship between regional changes in gene expression and proliferation. Regional features were extracted from tumour gene expression data using a novel clustering method, called genomic distance entrained agglomerative (GDEC) clustering. Using gene expression data in this way provides a simple means of integrating the phenotypic effects of both copy number aberrations and alterations in chromatin state. We show that regional metagenes derived from GDEC clustering are representative of recurrent regions of epigenetic regulation or copy number aberrations in breast cancer. Furthermore, detected patterns of genomic alterations are conserved across independent oestrogen receptor positive breast cancer datasets. Sequential competitive metagene selection was used to reveal the relative importance of genomic regions in predicting proliferation rate. The predictive model suggested additive interactions between the most informative regions such as 8p22-12 and 8q13-22.</p> <p>Conclusions</p> <p>Data-mining of large-scale microarray gene expression datasets can reveal regional clusters of co-ordinate gene expression, independent of cause. By correlating these clusters with tumour proliferation we have identified a number of genomic regions that act together to promote proliferation in ER+ breast cancer. Identification of such regions should enable prioritisation of genomic regions for combinatorial functional studies to pinpoint the key genes and interactions contributing to tumourigenicity.</p
Leveraging structure determination with fragment screening for infectious disease drug targets: MECP synthase from Burkholderia pseudomallei
As part of the Seattle Structural Genomics Center for Infectious Disease, we seek to enhance structural genomics with ligand-bound structure data which can serve as a blueprint for structure-based drug design. We have adapted fragment-based screening methods to our structural genomics pipeline to generate multiple ligand-bound structures of high priority drug targets from pathogenic organisms. In this study, we report fragment screening methods and structure determination results for 2C-methyl-D-erythritol-2,4-cyclo-diphosphate (MECP) synthase from Burkholderia pseudomallei, the gram-negative bacterium which causes melioidosis. Screening by nuclear magnetic resonance spectroscopy as well as crystal soaking followed by X-ray diffraction led to the identification of several small molecules which bind this enzyme in a critical metabolic pathway. A series of complex structures obtained with screening hits reveal distinct binding pockets and a range of small molecules which form complexes with the target. Additional soaks with these compounds further demonstrate a subset of fragments to only bind the protein when present in specific combinations. This ensemble of fragment-bound complexes illuminates several characteristics of MECP synthase, including a previously unknown binding surface external to the catalytic active site. These ligand-bound structures now serve to guide medicinal chemists and structural biologists in rational design of novel inhibitors for this enzyme
Structure of a Burkholderia pseudomallei Trimeric Autotransporter Adhesin Head
Pathogenic bacteria adhere to the host cell surface using a family of outer membrane proteins called Trimeric Autotransporter Adhesins (TAAs). Although TAAs are highly divergent in sequence and domain structure, they are all conceptually comprised of a C-terminal membrane anchoring domain and an N-terminal passenger domain. Passenger domains consist of a secretion sequence, a head region that facilitates binding to the host cell surface, and a stalk region.Pathogenic species of Burkholderia contain an overabundance of TAAs, some of which have been shown to elicit an immune response in the host. To understand the structural basis for host cell adhesion, we solved a 1.35 A resolution crystal structure of a BpaA TAA head domain from Burkholderia pseudomallei, the pathogen that causes melioidosis. The structure reveals a novel fold of an intricately intertwined trimer. The BpaA head is composed of structural elements that have been observed in other TAA head structures as well as several elements of previously unknown structure predicted from low sequence homology between TAAs. These elements are typically up to 40 amino acids long and are not domains, but rather modular structural elements that may be duplicated or omitted through evolution, creating molecular diversity among TAAs.The modular nature of BpaA, as demonstrated by its head domain crystal structure, and of TAAs in general provides insights into evolution of pathogen-host adhesion and may provide an avenue for diagnostics
SAD phasing using iodide ions in a high-throughput structural genomics environment
The Seattle Structural Genomics Center for Infectious Disease (SSGCID) focuses on the structure elucidation of potential drug targets from class A, B, and C infectious disease organisms. Many SSGCID targets are selected because they have homologs in other organisms that are validated drug targets with known structures. Thus, many SSGCID targets are expected to be solved by molecular replacement (MR), and reflective of this, all proteins are expressed in native form. However, many community request targets do not have homologs with known structures and not all internally selected targets readily solve by MR, necessitating experimental phase determination. We have adopted the use of iodide ion soaks and single wavelength anomalous dispersion (SAD) experiments as our primary method for de novo phasing. This method uses existing native crystals and in house data collection, resulting in rapid, low cost structure determination. Iodide ions are non-toxic and soluble at molar concentrations, facilitating binding at numerous hydrophobic or positively charged sites. We have used this technique across a wide range of crystallization conditions with successful structure determination in 16 of 17 cases within the first year of use (94% success rate). Here we present a general overview of this method as well as several examples including SAD phasing of proteins with novel folds and the combined use of SAD and MR for targets with weak MR solutions. These cases highlight the straightforward and powerful method of iodide ion SAD phasing in a high-throughput structural genomics environment
Determination of genetic structure of germplasm collections: are traditional hierarchical clustering methods appropriate for molecular marker data?
Despite the availability of newer approaches, traditional hierarchical clustering remains very popular in genetic diversity studies in plants. However, little is known about its suitability for molecular marker data. We studied the performance of traditional hierarchical clustering techniques using real and simulated molecular marker data. Our study also compared the performance of traditional hierarchical clustering with model-based clustering (STRUCTURE). We showed that the cophenetic correlation coefficient is directly related to subgroup differentiation and can thus be used as an indicator of the presence of genetically distinct subgroups in germplasm collections. Whereas UPGMA performed well in preserving distances between accessions, Ward excelled in recovering groups. Our results also showed a close similarity between clusters obtained by Ward and by STRUCTURE. Traditional cluster analysis can provide an easy and effective way of determining structure in germplasm collections using molecular marker data, and, the output can be used for sampling core collections or for association studies
- …