Search CORE

59,127 research outputs found

Combining Cluster Validation Indices for Detecting Label Noise

Author: Angelova Milena
Boeva Veselka
Kohstall Jan
Lundberg Lars
Publication venue
Publication date: 15/07/2020
Field of study

In this paper, we show that cluster validation indices can be used for filtering mislabeled instances or class outliers prior to training in supervised learning problems. We propose a technique, entitled Cluster Validation Index (CVI)-based Outlier Filtering, in which mislabeled instances are identified and eliminated from the training set, and a classification hypothesis is then built from the set of remaining instances. The proposed approach assigns each instance several cluster validation scores representing its potential of being an outlier with respect to the clustering properties the used validation measures assess. We examine CVI-based Outlier Filtering and compare it against the Local Outlier Factor (LOF) detection method on ten data sets from the UCI data repository using five well-known learning algorithms and three different cluster validation indices. In addition, we study and compare three different approaches for combining the selected cluster validation measures. Our results show that for most learning algorithms and data sets, the proposed CVI-based outlier filtering algorithm outperforms the baseline method (LOF). The greatest increase in classification accuracy has been achieved by using union or ranked-based median strategies to assemble the used cluster validation indices and global filtering of mislabeled instances

KITopen

Knowledge Reused Outlier Detection

Author: Ding Zhengming
Hu Chunming
Liu Hongfu
Yu Weiren
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2019
Field of study

Tremendous efforts have been invested in the unsupervised outlier detection research, which is conducted on unlabeled data set with abnormality assumptions. With abundant related labeled data available as auxiliary information, we consider transferring the knowledge from the labeled source data to facilitate the unsupervised outlier detection on target data set. To fully make use of the source knowledge, the source data and target data are put together for joint clustering and outlier detection using the source data cluster structure as a constraint. To achieve this, the categorical utility function is employed to regularize the partitions of target data to be consistent with source data labels. With an augmented matrix, the problem is completely solved by a K-means - a based method with the rigid mathematical formulation and theoretical convergence guarantee. We have used four real-world data sets and eight outlier detection methods of different kinds for extensive experiments and comparison. The results demonstrate the effectiveness and significant improvements of the proposed methods in terms of outlier detection and cluster validity metrics. Moreover, the parameter analysis is provided as a practical guide, and noisy source label analysis proves that the proposed method can handle real applications where source labels can be noisy

IUPUIScholarWorks

Outlier Detection Method on UCI Repository Dataset by Entropy Based Rough K-means

Author: Nawaz G.M Kadhar
P. Ashok
Publication venue: 'Defence Scientific Information and Documentation Centre'
Publication date: 23/03/2016
Field of study

Rough set theory is used to handle uncertainty and incomplete information by applying two sets, lower and upper approximation. In this paper, the clustering process is improved by adapting the preliminary centroid selection method on rough K-means (RKM) algorithm. The entropy based rough K-means (ERKM) method is developed by adapting entropy based preliminary centroids selection on RKM and executed and also validated by cluster validity indexes. An example shows that the ERKM performs effectively by selection of entropy based preliminary centroid. In addition, Outlier detection is an important task in data mining and very much different from the rest of the objects in the cluster. Entropy based rough outlier factor (EROF) method is used to detect outlier effectively for yeast dataset. An example shows that EROF detects outlier effectively on protein localisation sites and ERKM clustering algorithm performed effectively. Further, experimental readings show that the ERKM and EROF method outperformed the other methods.

Defence Science Journal

ANOMALY DETECTION PADA INTRUSION DETECTION SYSTEM (IDS) MENGGUNAKAN METODE CLUSTERING ANOMALY DETECTION ON INTRUSION DETECTION SYSTEM (IDS) BY CLUSTERING METHOD

Author: A N A N D A B U D I M U L I A
Publication venue: Universitas Telkom
Publication date: 01/01/2006
Field of study

ABSTRAKSI: Intrusion Detection System (IDS) adalah sekumpulan teknik dan metode untuk mendeteksi aktivitas-aktivitas yang terjadi pada level network dan host. Pada sistem ini terdapat dua pendekatan yang dilakukan : signature-based intrusion detection systems dan anomaly detection system. Pendekatan yang pertama memiliki kelemahan yang cukup rentan, yaitu pendeteksian hanya akan dilakukan terhadap data yang sudah didefinisikan. Sementara untuk anomaly detection, selain menggunakan data yang sudah didefinisikan, dapat pula dilakukan dengan menganalisis pola-pola anomali dari paket network yang datang, namun jika salah mengambil parameter maka metode ini justru akan sering mengakibatkan false alarm.Untuk menganalisis anomaly detection pada paket yang datang dapat dilakukan dengan menggunakan outlier detection scheme. Dengan metode ini, paket-paket yang datang akan dianalisis dengan menggunakan beberapa algoritma, diantaranya adalah clustering. Algoritma clustering pada metode outlier detection scheme melakukan analisis dengan cara meng-cluster-kan data dan menandai cluster terkecil, kemudian cluster terkecil tersebut akan dianggap sebagai anomali.Dalam Tugas Akhir ini dibangun suatu implementasi pendeteksian intrusion (serangan) terhadap sistem atau jaringan komputer menggunakan metode anomaly detection dengan algoritma cluster-based outlier detection. Proses clustering itu sendiri dilakukan terhadap data koneksi jaringan. Adapun implementasi dilakukan dengan menggunakan bahasa pemrograman HTML, script PHP dan DBMS MySQL.Pengujian terhadap sistem anomaly detection ini menunjukkan hasil akhir bahwa hasil pendeteksian anomali sangat bergantung pada tiga hal hal, yaitu tergantung pada pemilihan data yang digunakan untuk dianalisis (dataset), jarak maksimal yang diijinkan dari titik pusat cluster atau center ke setiap data yang menjadi anggota dari cluster tersebut atau biasa disebut jari jari cluster, dan perbandingan jumlah data instrusion dengan data normal pada dataset.Kata Kunci : Intrusion Detection System(IDS), clustering, anomaly detection, outlier detection scheme.ABSTRACT: Intrusion Detection System (IDS) is a group of techniques and methods for detecting activities that hapenned in network and host level. IDS has two approaches : signature-based intrusion detection system and anomaly detection system. First approach has any weakness, the detection can only done if the intrusion had been definited. Therefore except using the data which had been definited, we can also analyze anomaly patterns from the packets , but if we take the wrong parameter this method could eventually be a false alarm.Analyze anomaly detection in network data packets can be handled by outlier detection scheme method. With this method we can build the analysis with some algorithms, one of the algorithms is clustering. Clustering algorithm clustered the data and mark the smallest cluster with assumption that smallest cluster as an anomaly.This final Project will build an implementation of intrusion detection system in computer or network system using anomaly detection method with cluster-based outlier detection algorithm. The process is to clustering data connection record. Implementation use HTML programming language, PHP script, and MySQL DBMS.Anomaly detection system evaluation shows that the results are depend on three things, data which have been analyzed or data set given and the maximum distance betwen center to each data point that included in that cluster, or cluster radius values and ratio between normal data and instrusion data.Keyword: Intrusion Detection System(IDS), clustering, anomaly detection, outlier detection scheme

Open Library

Analisis Implementasi Deteksi Outlier Berbasis Klaster pada Data Kategorikal dengan Menggunakan Algoritma CBLOF Implementation Analysis Cluster-Based Outlier Detection in Categorical Data using CBLOF Algorithm

Author: Arif Pradita Herman
Publication venue: Universitas Telkom
Publication date: 01/01/2011
Field of study

ABSTRAKSI: Deteksi outlier merupakan salah satu fungsionalitas dalam data mining yang bertujuan untuk mencari data yang berbeda dengan mayoritas data lainnya. Walaupun memiliki perilaku yang berbeda dengan mayoritas data lainnya, outlier sering mengandung informasi yang sangat berguna. Ada banyak metode untuk mendeteksi outlier, namun kebanyakan didesain untuk data numerik dan tidak cocok diterapkan dalam data kategorikal. Selain itu, banyak algoritma yang membutuhkan waktu proses yang lama seiring bertambahnya jumlah data. CBLOF (Cluster Based Local Outlier Factor) merupakan suatu metode untuk mendeteksi outlier pada data kategorikal berbasiskan klaster. Nilai CBLOF untuk tiap data akan dihitung, berdasarkan kondisi data tersebut termasuk dalam large cluster atau small cluster, untuk menentukan data tersebut outlier atau tidak. Pengujian dilakukan dengan beberapa skenario untuk mengetahui akurasi berdasarkan detection rate, false positive rate serta false negative rate, pengaruh persentase rare class terhadap akurasi dan pengaruh jumlah data terhadap waktu proses. CBLOF dapat mendeteksi outlier dengan tingkat akurasi relatif baik dilihat berdasarkan detection rate, false positive rate dan false negative rate. Selain itu, prosesnya juga cepat karena CBLOF hanya perlu membaca dataset satu kali hingga didapatkan data yang dianggap sebagai outlier atau tidak. Kata Kunci : outlier, klaster, kategorikal, CBLOFABSTRACT: Outlier detection is one of data mining functionalities that aims to find data that are different from other majority data. Although it has a different behavior with the other majority data, outliers often contain very useful information. There are many methods to detect outliers, but most are designed for numeric data and not appropriate for categorical data. Moreover, many algorithms take time to process increasing amounts of data. CBLOF (Cluster Based Local Outlier Factor) is a method for detecting outlier for categorical data based on clusters. A CBLOF value calculated for each data, is based on the condition that data are included in large clusters or small clusters, whether outlier data or not. Tests carried out with several scenarios to find out the accuracy based on the detection rate, false positive rate and false negative rate, influence the percentage of rare class on accuracy and influence the amount of data on processing time. CBLOF can detect outliers with relatively good accuracy, based on detection rate, false positive rate and false negative rate. In addition, the process is also faster because CBLOF will only read once the dataset for a data that is considered as an outlier or otherwise.Keyword: outlier, cluster, categorical, CBLO

Open Library

Splitting hybrid Make-To-Order and Make-To-Stock demand profiles

Author: Aitken James
Garn Wolfgang
Publication venue
Publication date: 14/04/2015
Field of study

In this paper a demand time series is analysed to support Make-To-Stock (MTS) and Make-To-Order (MTO) production decisions. Using a purely MTS production strategy based on the given demand can lead to unnecessarily high inventory levels thus it is necessary to identify likely MTO episodes. This research proposes a novel outlier detection algorithm based on special density measures. We divide the time series' histogram into three clusters. One with frequent-low volume covers MTS items whilst a second accounts for high volumes which is dedicated to MTO items. The third cluster resides between the previous two with its elements being assigned to either the MTO or MTS class. The algorithm can be applied to a variety of time series such as stationary and non-stationary ones. We use empirical data from manufacturing to study the extent of inventory savings. The percentage of MTO items is reflected in the inventory savings which were shown to be an average of 18.1%.Comment: demand analysis; time series; outlier detection; production strategy; Make-To-Order(MTO); Make-To-Stock(MTS); 15 pages, 9 figure

arXiv.org e-Print Archive

Surrey Research Insight

Outlier detection using modified-ranks and other variants

Author: Huang Huaming
Mehrotra Kishan
Mohan Chilukuri K.
Publication venue: SURFACE at Syracuse University
Publication date: 01/12/2011
Field of study

Rank based algorithms provide a promising approach for outlier detection, but currently used rank-based measures of outlier detection suffer from two deficiencies: first they take a large value from an object whose density is high even though the object may not be an outlier and second the distance between the object and its nearest cluster plays a mild role though its rank with respect to its neighbor. To correct for these deficiencies we introduce the concept of modified-rank and propose new algorithms for outlier detection based on this concept

Syracuse University Research Facility and Collaborative Environment