28,666 research outputs found

    Performance Evaluation of EM and K-Means Clustering Algorithms in Data Mining System

    Get PDF
    In the Emerging field of Data Mining System there are different techniques namely Clustering, Prediction, Classification, and Association etc. Clustering technique performs by dividing the particular data set into associated groups such that every group does not have anything in common.Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. Actually the main goal is to classify data into clusters such that objects are clustered in the same cluster when they are related according to particular metrics. Classification is the organization of data sets into some predefined sets using various mathematical models. This research discusses the comparison of algorithms K-Means and Expectation-Maximization in clustering. Empirically, we focused on wide experiments where wecompared the best typical algorithm from each of the categories using a large number of real or bigdata sets. The effectiveness of the Expectation-Maximization clustering algorithm is measured through a number of internaland external validity metrics, stability, runtime and scalability tests

    Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses

    Get PDF
    Objective: Clustering algorithms may be applied to the analysis of DNA microarray data to identify novel subgroups that may lead to new taxonomies of diseases defined at bio-molecular level. A major problem related to the identification of biologically meaningful clusters is the assessment of their reliability, since clustering algorithms may find clusters even if no structure is present. Methodology: Recently, methods based on random "perturbations" of the data, such as bootstrapping, noise injections techniques and random subspace methods have been applied to the problem of cluster validity estimation. In this framework, we propose stability measures that exploits the high dimensionality of DNA microarray data and the redundancy of information stored in microarray chips. To this end we randomly project the original gene expression data into lower dimensional subspaces, approximately preserving the distance between the examples according to the Johnson-Lindenstrauss (JL) theory. The stability of the clusters discovered in the original high dimensional space is estimated by comparing them with the clusters discovered in randomly projected lower dimensional subspaces. The proposed cluster-stability measures may be applied to validate and to quantitatively assess the reliability of the clusters obtained by a large class of clustering algorithms. Results and conclusion: We tested the effectiveness of our approach with high dimensional synthetic data, whose distribution is a priori known, showing that the stability measures based on randomized maps correctly predict the number of clusters and the reliability of each individual cluster. Then we showed how to apply the proposed measures to the analysis of DNA microarray data, whose underlying distribution is unknown. We evaluated the validity of clusters discovered by hierarchical clustering algorithms in diffuse large B-cell lymphoma (DLBCL) and malignant melanoma patients, showing that the proposed reliability measures can support bio-medical researchers in the identification of stable clusters of patients and in the discovery of new subtypes of diseases characterized at bio-molecular level

    Cluster validity in clustering methods

    Get PDF

    A proposal of a methodological framework with experimental guidelines to investigate clustering stability on financial time series

    Full text link
    We present in this paper an empirical framework motivated by the practitioner point of view on stability. The goal is to both assess clustering validity and yield market insights by providing through the data perturbations we propose a multi-view of the assets' clustering behaviour. The perturbation framework is illustrated on an extensive credit default swap time series database available online at www.datagrapple.com.Comment: Accepted at ICMLA 201

    Typical Phone Use Habits: Intense Use Does Not Predict Negative Well-Being

    Full text link
    Not all smartphone owners use their device in the same way. In this work, we uncover broad, latent patterns of mobile phone use behavior. We conducted a study where, via a dedicated logging app, we collected daily mobile phone activity data from a sample of 340 participants for a period of four weeks. Through an unsupervised learning approach and a methodologically rigorous analysis, we reveal five generic phone use profiles which describe at least 10% of the participants each: limited use, business use, power use, and personality- & externally induced problematic use. We provide evidence that intense mobile phone use alone does not predict negative well-being. Instead, our approach automatically revealed two groups with tendencies for lower well-being, which are characterized by nightly phone use sessions.Comment: 10 pages, 6 figures, conference pape

    Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes

    Get PDF
    A key issue in cluster analysis is the choice of an appropriate clustering method and the determination of the best number of clusters. Different clusterings are optimal on the same data set according to different criteria, and the choice of such criteria depends on the context and aim of clustering. Therefore, researchers need to consider what data analytic characteristics the clusters they are aiming at are supposed to have, among others within-cluster homogeneity, between-clusters separation, and stability. Here, a set of internal clustering validity indexes measuring different aspects of clustering quality is proposed, including some indexes from the literature. Users can choose the indexes that are relevant in the application at hand. In order to measure the overall quality of a clustering (for comparing clusterings from different methods and/or different numbers of clusters), the index values are calibrated for aggregation. Calibration is relative to a set of random clusterings on the same data. Two specific aggregated indexes are proposed and compared with existing indexes on simulated and real data.Comment: 42 pages, 11 figure

    Design and Analysis of SD_DWCA - A Mobility based clustering of Homogeneous MANETs

    Full text link
    This paper deals with the design and analysis of the distributed weighted clustering algorithm SD_DWCA proposed for homogeneous mobile ad hoc networks. It is a connectivity, mobility and energy based clustering algorithm which is suitable for scalable ad hoc networks. The algorithm uses a new graph parameter called strong degree defined based on the quality of neighbours of a node. The parameters are so chosen to ensure high connectivity, cluster stability and energy efficient communication among nodes of high dynamic nature. This paper also includes the experimental results of the algorithm implemented using the network simulator NS2. The experimental results show that the algorithm is suitable for high speed networks and generate stable clusters with less maintenance overhead
    • …
    corecore