9 research outputs found

    Analysis of Mass Based and Density Based Clustering Techniques on Numerical Datasets

    Get PDF
    Clustering is the techniques adopted by data mining tools across a range of application . It provides several algorithms that can assess large data set based on specific parameters & group related points  . This paper gives comparative analysis of density based clustering algorithms and mass based clustering algorithms. DBSCAN [15] is a base algorithm for density based clustering techniques. One of the advantages of using these techniques is that method does not require the number of clusters to be given a prior and it can detect the clusters of different shapes and sizes from large amount of data which contains noise and outliers. OPTICS [14] on the other hand does not produce a clustering of a data set explicitly, but instead creates an augmented ordering of the database representing its density based clustering structure. Mass based clustering algorithm   mass estimation technique is used (it is alternate of density based clustering) .In Mass based clustering algorithm [22] there are also core regions and noise points are used as a parameter. We analyze the algorithms in terms of the parameters essential for creating meaningful clusters. All the algorithms are tested using numerical data sets for low as well as high dimensional data sets. Keywords: Mass Based (DEMassDBSCAN) ,DBSCAN,OPTICS

    Unsupervised learning in high-dimensional space

    Full text link
    Thesis (Ph.D.)--Boston UniversityIn machine learning, the problem of unsupervised learning is that of trying to explain key features and find hidden structures in unlabeled data. In this thesis we focus on three unsupervised learning scenarios: graph based clustering with imbalanced data, point-wise anomaly detection and anomalous cluster detection on graphs. In the first part we study spectral clustering, a popular graph based clustering technique. We investigate the reason why spectral clustering performs badly on imbalanced and proximal data. We then propose the partition constrained minimum cut (PCut) framework based on a novel parametric graph construction method, that is shown to adapt to different degrees of imbalanced data. We analyze the limit cut behavior of our approach, and demonstrate the significant performance improvement through clustering and semi-supervised learning experiments on imbalanced data. [TRUNCATED

    Concentration of measure, negative association, and machine learning

    Full text link
    In this thesis we consider concentration inequalities and the concentration of measure phenomenon from a variety of angles. Sharp tail bounds on the deviation of Lipschitz functions of independent random variables about their mean are well known. We consider variations on this theme for dependent variables on the Boolean cube. In recent years negatively associated probability distributions have been studied as potential generalizations of independent random variables. Results on this class of distributions have been sparse at best, even when restricting to the Boolean cube. We consider the class of negatively associated distributions topologically, as a subset of the general class of probability measures. Both the weak (distributional) topology and the total variation topology are considered, and the simpler notion of negative correlation is investigated. The concentration of measure phenomenon began with Milman's proof of Dvoretzky's theorem, and is therefore intimately connected to the field of high-dimensional convex geometry. Recently this field has found application in the area of compressed sensing. We consider these applications and in particular analyze the use of Gordon's min-max inequality in various compressed sensing frameworks, including the Dantzig selector and the matrix uncertainty selector. Finally we consider the use of concentration inequalities in developing a theoretically sound anomaly detection algorithm. Our method uses a ranking procedure based on KNN graphs of given data. We develop a max-margin learning-to-rank framework to train limited complexity models to imitate these KNN scores. The resulting anomaly detector is shown to be asymptotically optimal in that for any false alarm rate α, its decision region converges to the α-percentile minimum volume level set of the unknown underlying density

    Mass estimation and its applications

    No full text
    This paper introduces mass estimation—a base modelling mechanism in data mining. It provides the theoretical basis of mass and an efficient method to estimate mass. We show that it solves problems very effectively in tasks such as information retrieval, regression and anomaly detection. The models, which use mass in these three tasks, perform at least as good as and often better than a total of eight state-of-theart methods in terms of task-specific performance measures. In addition, mass estimation has constant time and space complexities
    corecore