499 research outputs found

    Methods for fast and reliable clustering

    Get PDF

    A Fast Clustering Algorithm based on pruning unnecessary distance computations in DBSCAN for High-Dimensional Data

    Get PDF
    Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2D >  > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data

    A Review of Clustering Algorithms for Clustering Uncertain Data

    Get PDF
    Clustering is an important task in the Data Mining. Clustering on uncertain data is a challenging in both modeling similarity between objects of uncertain data and developing efficient computational method. The most of the previous method for clustering uncertain data extends partitioning clustering methods and Density based clustering methods, which are based on geometrical distance between two objects. Such method cannot handle uncertain objects that are cannot distinguishable by using geometric properties and Distribution regarding to object itself is not considered. Probability distribution is an important characteristic is not considered during measuring similarity between two uncertain objects. The well known technique Kullbak-Leibler divergence used to measures the similarity between two uncertain objects. The goal of this paper is to provide detailed review about clustering uncertain data by using different methods & showing effectiveness of each algorithm

    Identification and characterization of irregular consumptions of load data

    Get PDF
    The historical information of loadings on substation helps in evaluation of size of photovoltaic (PV) generation and energy storages for peak shaving and distribution system upgrade deferral. A method, based on consumption data, is proposed to separate the unusual consumption and to form the clusters of similar regular consumption. The method does optimal partition of the load pattern data into core points and border points, high and less dense regions, respectively. The local outlier factor, which does not require fixed probability distribution of data and statistical measures, ranks the unusual consumptions on only the border points, which are a few percent of the complete data. The suggested method finds the optimal or close to optimal number of clusters of similar shape of load patterns to detect regular peak and valley load demands on different days. Furthermore, identification and characterization of features pertaining to unusual consumptions in load pattern data have been done on border points only. The effectiveness of the proposed method and characterization is tested on two practical distribution systems

    An Improved Model of Virtual Classroom using Information Fusion and NS-DBSCAN

    Get PDF
    Virtual classroom is a latest concept of learning platform. It provides an environment by incorporating internet technology where teachers, students, researchers and interested people can interact, collaborate, communicate and explain their thoughts and views in well organized, technical and pedagogical procedure. Regarding present global context, the virtual classrooms is a popular technology. Very reknown e-learning platforms are Blackboard, Schoology, Moodle (Modular Object-Oriented Dynamic Learning Environment), Canvas and google classroom. In this thesis, we propose an efficient model of virtual classroom to enhance the facility of current e-learning system. To develop the model of virtual classroom, the thesis integrates the policy of cloud computing with information fusion (IF) technique for providing a ubiquitous learning capacity from an e-learning platform. In our proposed model, Density Based Spatial Clustering of Application with Noise (DBSCAN) algorithm is used for separating different layers of data to reduce time complexity and enhance data security. Here we also demonstrate the complete architecture of cloud based e-learning process through our proposed virtual classroom

    A Short Survey on Data Clustering Algorithms

    Full text link
    With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial analysis. Formally speaking, given a set of data instances, a clustering algorithm is expected to divide the set of data instances into the subsets which maximize the intra-subset similarity and inter-subset dissimilarity, where a similarity measure is defined beforehand. In this work, the state-of-the-arts clustering algorithms are reviewed from design concept to methodology; Different clustering paradigms are discussed. Advanced clustering algorithms are also discussed. After that, the existing clustering evaluation metrics are reviewed. A summary with future insights is provided at the end
    corecore