111,679 research outputs found

    Kernel Spectral Clustering and applications

    Full text link
    In this chapter we review the main literature related to kernel spectral clustering (KSC), an approach to clustering cast within a kernel-based optimization setting. KSC represents a least-squares support vector machine based formulation of spectral clustering described by a weighted kernel PCA objective. Just as in the classifier case, the binary clustering model is expressed by a hyperplane in a high dimensional space induced by a kernel. In addition, the multi-way clustering can be obtained by combining a set of binary decision functions via an Error Correcting Output Codes (ECOC) encoding scheme. Because of its model-based nature, the KSC method encompasses three main steps: training, validation, testing. In the validation stage model selection is performed to obtain tuning parameters, like the number of clusters present in the data. This is a major advantage compared to classical spectral clustering where the determination of the clustering parameters is unclear and relies on heuristics. Once a KSC model is trained on a small subset of the entire data, it is able to generalize well to unseen test points. Beyond the basic formulation, sparse KSC algorithms based on the Incomplete Cholesky Decomposition (ICD) and L0L_0, L1,L0+L1L_1, L_0 + L_1, Group Lasso regularization are reviewed. In that respect, we show how it is possible to handle large scale data. Also, two possible ways to perform hierarchical clustering and a soft clustering method are presented. Finally, real-world applications such as image segmentation, power load time-series clustering, document clustering and big data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms

    Spatial clustering of array CGH features in combination with hierarchical multiple testing

    Get PDF
    We propose a new approach for clustering DNA features using array CGH data from multiple tumor samples. We distinguish data-collapsing: joining contiguous DNA clones or probes with extremely similar data into regions, from clustering: joining contiguous, correlated regions based on a maximum likelihood principle. The model-based clustering algorithm accounts for the apparent spatial patterns in the data. We evaluate the randomness of the clustering result by a cluster stability score in combination with cross-validation. Moreover, we argue that the clustering really captures spatial genomic dependency by showing that coincidental clustering of independent regions is very unlikely. Using the region and cluster information, we combine testing of these for association with a clinical variable in an hierarchical multiple testing approach. This allows for interpreting the significance of both regions and clusters while controlling the Family-Wise Error Rate simultaneously. We prove that in the context of permutation tests and permutation-invariant clusters it is allowed to perform clustering and testing on the same data set. Our procedures are illustrated on two cancer data sets

    Parallel Hierarchical Affinity Propagation with MapReduce

    Full text link
    The accelerated evolution and explosion of the Internet and social media is generating voluminous quantities of data (on zettabyte scales). Paramount amongst the desires to manipulate and extract actionable intelligence from vast big data volumes is the need for scalable, performance-conscious analytics algorithms. To directly address this need, we propose a novel MapReduce implementation of the exemplar-based clustering algorithm known as Affinity Propagation. Our parallelization strategy extends to the multilevel Hierarchical Affinity Propagation algorithm and enables tiered aggregation of unstructured data with minimal free parameters, in principle requiring only a similarity measure between data points. We detail the linear run-time complexity of our approach, overcoming the limiting quadratic complexity of the original algorithm. Experimental validation of our clustering methodology on a variety of synthetic and real data sets (e.g. images and point data) demonstrates our competitiveness against other state-of-the-art MapReduce clustering techniques

    Weltklimapolitik im Kongobecken : Neue Chance oder Ökorente fĂŒr die Eliten?

    Get PDF
    The objective with this Master’s thesis was to develop, implement and evaluate an iterative procedure for hierarchical clustering with good overall performance which also merges features of certain already described algorithms into a single integrated package. An accordingly built tool was then applied to an allergen IgE-reactivity data set. The finally implemented algorithm uses a hierarchical approach which illustrates the emergence of patterns in the data. At each level of the hierarchical tree a partitional clustering method is used to divide data into k groups, where the number k is decided through application of cluster validation techniques. The cross-reactivity analysis, by means of the new algorithm, largely arrives at anticipated cluster formations in the allergen data, which strengthen results obtained through previous studies on the subject. Notably, though, certain unexpected findings presented in the former analysis where aggregated differently, and more in line with phylogenetic and protein family relationships, by the novel clustering package

    Development and validation of clinical profiles of patients hospitalized due to behavioral and psychological symptoms of dementia.

    Get PDF
    Patients hospitalized on acute psychogeriatric wards are a heterogeneous population. Cluster analysis is a useful statistical method for partitioning a sample of patients into well separated groups of patients who present common characteristics. Several patient profile studies exist, but they are not adapted to acutely hospitalized psychogeriatric patients with cognitive impairment. The present study aims to partition patients hospitalized due to behavioral and psychological symptoms of dementia into profiles based on a global evaluation of mental health using cluster analysis. Using nine of the 13 items from the Health of the Nation Outcome Scales for elderly people (HoNOS65+), data were collected from a sample of 542 inpatients with dementia who were hospitalized between 2011 and 2014 in acute psychogeriatric wards of a Swiss university hospital. An optimal clustering solution was generated to represent various profiles, by using a mixed approach combining hierarchical and non-hierarchical (k-means) cluster analyses associated with a split-sample cross-validation. The quality of the clustering solution was evaluated based on a cross-validation, on a k-means method with 100 random initial seeds, on validation indexes, and on clinical interpretation. The final solution consisted of four clinically distinct and homogeneous profiles labeled (1) BPSD-affective, (2) BPSD-functional, (3) BPSD-somatic and (4) BPSD-psychotic according to their predominant clinical features. The four profiles differed in cognitive status, length of hospital stay, and legal admission status. In the present study, clustering methods allowed us to identify four profiles with distinctive characteristics. This clustering solution may be developed into a classification system that may allow clinicians to differentiate patient needs in order to promptly identify tailored interventions and promote better allocation of available resources

    Definition of a family of tissue-protective cytokines using functional cluster analysis: a proof-of-concept study

    Get PDF
    The discovery of the tissue-protective activities of erythropoietin (EPO) has underlined the importance of some cytokines in tissue-protection, repair, and remodeling. As such activities have been reported for other cytokines, we asked whether we could define a class of tissue-protective cytokines. We therefore explored a novel approach based on functional clustering. In this pilot study, we started by analyzing a small number of cytokines (30). We functionally classified the 30 cytokines according to their interactions by using the bioinformatics tool STRING (Search Tool for the Retrieval of Interacting Genes), followed by hierarchical cluster analysis. The results of this functional clustering were different from those obtained by clustering cytokines simply according to their sequence. We previously reported that the protective activity of EPO in a model of cerebral ischemia was paralleled by an upregulation of synaptic plasticity genes, particularly early growth response 2 (EGR2). To assess the predictivity of functional clustering, we tested some of the cytokines clustering close to EPO (interleukin-11, IL-11; kit ligand, KITLG; leukemia inhibitory factor, LIF; thrombopoietin, THPO) in an in vitro model of human neuronal cells for their ability to induce EGR2. Two of these, LIF and IL-11, induced EGR2 expression. Although these data would need to be extended to a larger number of cytokines and the biological validation should be done using more robust in vivo models, rather then just one cell line, this study shows the feasibility of this approach. This type of functional cluster analysis could be extended to other fields of cytokine research and help design biological experiments

    ENHANCEMENT OF DECISION TREE METHOD BASED ON HIERARCHICAL CLUSTERING AND DISPERSION RATIO

    Get PDF
    The classification process using a decision tree is a classification method that has a feature selection process. Decision tree classifications using information gain have a disadvantage when the dataset has unique attributes for each imbalanced class record and distribution. The data used for decision tree classification has 2 types, numerical and nominal. The numerical data type is carried out a discretization process so that it gets data intervals. Weaknesses in the information gain method can be reduced by using a dispersion ratio method that does not depend on the class distribution, but on the frequency distribution. Numeric type data will be dis-criticized using the hierarchical clustering method to obtain a balanced data cluster. The data used in this study were taken from the UCI machine learning repository, which has two types of numeric and nominal data. There are two stages in this research namely, first the numeric type data will be discretized using hierarchical clustering with 3 methods, namely single link, complete link, and average link. Second, the results of discretization will be merged again then the formation of trees with splitting attributes using dispersion ratio and evaluated with cross-validation k-fold 7. The results obtained show that the discretization of data with hierarchical clustering can increase predictions by 14.6% compared with data without discretization. The attribute splitting process with the dispersion ratio of the data resulting from the discretization of hierarchical clustering can increase the prediction by 6.51%

    Are there three main subgroups within the patellofemoral pain population? A detailed characterisation study of 127 patients to help develop targeted Intervention (TIPPs)

    Get PDF
    ‱ Background Current multimodal approaches for the management of non-specific patellofemoral pain are not optimal, however, targeted intervention for subgroups could improve patient outcomes. This study explores whether subgrouping of non-specific patellofemoral pain patients, using a series of low cost simple clinical tests, is possible. ‱ Method The exclusivity and clinical importance of potential subgroups was assessed by applying à priori test thresholds (1 SD) from seven clinical tests in a sample of adult patients with non-specific patellofemoral pain. Hierarchical clustering and latent profile analysis, were used to gain additional insights into subgroups using data from the same clinical tests. ‱ Results One hundred and thirty participants were recruited, 127 had complete data: 84 (66%) female, mean age 26 years (SD 5.7) and mean BMI 25.4 (SD 5.83), median (IQR) time between onset of pain and assessment was 24 (7-60) months. Potential subgroups defined by the à priori test thresholds were not mutually exclusive and patients frequently fell into multiple subgroups. Using hierarchical clustering and latent profile analysis three subgroups were identified using 6 of the 7 clinical tests. These subgroups were given the following nomenclature: (i) ‘strong’, (ii) ‘weak and tighter’, and (iii) ‘weak and pronated foot’. ‱ Conclusions We conclude that three subgroups of patellofemoral patients may exist based on the results of six clinical tests which are feasible to perform in routine clinical practice. Further research is needed to validate these findings in other datasets and, if supported by external validation, to see if targeted interventions for these subgroups improve patient outcomes
    • 

    corecore