380,234 research outputs found

    SOTXTSTREAM: Density-based self-organizing clustering of text streams

    Get PDF
    A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets

    ANALISA PERBANDINGAN CLUSTERING-BASED, DISTANCE-BASED DAN DENSITY-BASED DALAM MENDETEKSI OUTLIER ANALYSIS OF CLUSTERING-BASED, DISTANCE-BASED AND DENSITY-BASED IN OUTLIER DETECTION

    Get PDF
    ABSTRAKSI: Data Mining adalah proses pencarian pola-pola dan kecenderungan yang menarik dari dalam basis data berukuran besar. Sebuah outlier didefinisikan sebagai sebuah titik data pada suatu data set dimana sangat berbeda dibandingkan dengan titik data pada data set pada umumnya dengan suatu ukuran tertentu. Outlier ini walaupun mempunyai kelakuan yang abnormal, seringkali mengandung informasi yang sangat berguna. Permasalahan deteksi outlier ini mempunyai peran yang sangat penting pada aplikasi deteksi kecurangan, analisis kekuatan jaringan dan deteksi intrusi.Pencarian outlier biasanya dengan konsep keterdekatan berdasarkan hubungannya dengan sisa data yang ada. Pada data berdimensi tinggi, kepadatan data akan semakin berkurang, akibatnya dugaan akan keterdekatan antar data menjadi gagal.Pada tugas akhir ini akan dilakukan perbandingan metode dalam pencarian suatu outlier dalam data berdimensi tinggi. Metode yang akan dibandingkan yaitu: Clustering-based, Density-based, dan Distance-based. Dimana masing-masing metode telah mendukung data berdimensi tinggi.Kata Kunci : data mining, outlier, deteksi outlier, metode deteksi outlier.ABSTRACT: Data mining is interesting patterns and trend finding process in large database. Outlier defined as a data point in database where is different than data point from common database with fixed size. Even outlier have an abnormal behaviour, often contain important information. Outlier detection have important role in fraud detection, intrusion detection, and network monitoring application.Finding an outlier usually using proximity based on existing remain data. In high dimensional data, data become spare, finally proximity notion data become failed.In this final assignment, will doing methods comparison finding outlier in high dimensional data. Existing methods which will be use is Clustering-based, Density-based, and Distance-based. Where each methods support on high dimensional data.Keyword: data mining,outlier, outlier detection, outlier detection method

    Cluster Evaluation of Density Based Subspace Clustering

    Full text link
    Clustering real world data often faced with curse of dimensionality, where real world data often consist of many dimensions. Multidimensional data clustering evaluation can be done through a density-based approach. Density approaches based on the paradigm introduced by DBSCAN clustering. In this approach, density of each object neighbours with MinPoints will be calculated. Cluster change will occur in accordance with changes in density of each object neighbours. The neighbours of each object typically determined using a distance function, for example the Euclidean distance. In this paper SUBCLU, FIRES and INSCY methods will be applied to clustering 6x1595 dimension synthetic datasets. IO Entropy, F1 Measure, coverage, accurate and time consumption used as evaluation performance parameters. Evaluation results showed SUBCLU method requires considerable time to process subspace clustering; however, its value coverage is better. Meanwhile INSCY method is better for accuracy comparing with two other methods, although consequence time calculation was longer.Comment: 6 pages, 15 figure

    Fast multi-image matching via density-based clustering

    Full text link
    We consider the problem of finding consistent matches across multiple images. Previous state-of-the-art solutions use constraints on cycles of matches together with convex optimization, leading to computationally intensive iterative algorithms. In this paper, we propose a clustering-based formulation. We first rigorously show its equivalence with the previous one, and then propose QuickMatch, a novel algorithm that identifies multi-image matches from a density function in feature space. We use the density to order the points in a tree, and then extract the matches by breaking this tree using feature distances and measures of distinctiveness. Our algorithm outperforms previous state-of-the-art methods (such as MatchALS) in accuracy, and it is significantly faster (up to 62 times faster on some bechmarks), and can scale to large datasets (with more than twenty thousands features).Accepted manuscriptSupporting documentatio
    • …
    corecore