5 research outputs found

    Knowledge Reused Outlier Detection

    Get PDF
    Tremendous efforts have been invested in the unsupervised outlier detection research, which is conducted on unlabeled data set with abnormality assumptions. With abundant related labeled data available as auxiliary information, we consider transferring the knowledge from the labeled source data to facilitate the unsupervised outlier detection on target data set. To fully make use of the source knowledge, the source data and target data are put together for joint clustering and outlier detection using the source data cluster structure as a constraint. To achieve this, the categorical utility function is employed to regularize the partitions of target data to be consistent with source data labels. With an augmented matrix, the problem is completely solved by a K-means - a based method with the rigid mathematical formulation and theoretical convergence guarantee. We have used four real-world data sets and eight outlier detection methods of different kinds for extensive experiments and comparison. The results demonstrate the effectiveness and significant improvements of the proposed methods in terms of outlier detection and cluster validity metrics. Moreover, the parameter analysis is provided as a practical guide, and noisy source label analysis proves that the proposed method can handle real applications where source labels can be noisy

    Structure-Preserved Unsupervised Domain Adaptation

    Get PDF
    Domain adaptation has been a primal approach to addressing the issues by lack of labels in many data mining tasks. Although considerable efforts have been devoted to domain adaptation with promising results, most existing work learns a classifier on a source domain and then predicts the labels for target data, where only the instances near the boundary determine the hyperplane and the whole structure information is ignored. Moreover, little work has been done regarding to multi-source domain adaptation. To that end, we develop a novel unsupervised domain adaptation framework, which ensures the whole structure of source domains is preserved to guide the target structure learning in a semi-supervised clustering fashion. To our knowledge, this is the first time when the domain adaptation problem is re-formulated as a semi-supervised clustering problem with target labels as missing values. Furthermore, by introducing an augmented matrix, a non-trivial solution is designed, which can be exactly mapped into a K-means-like optimization problem with modified distance function and update rule for centroids in an efficient way. Extensive experiments on several widely-used databases show the substantial improvements of our proposed approach over the state-of-the-art methods

    Weighting Policies for Robust Unsupervised Ensemble Learning

    Get PDF
    The unsupervised ensemble learning, or consensus clustering, consists of finding the optimal com- bination strategy of individual partitions that is robust in comparison to the selection of an algorithmic clustering pool. Despite its strong properties, this approach assigns the same weight to the contribution of each clustering to the final solution. We propose a weighting policy for this problem that is based on internal clustering quality measures and compare against other modern approaches. Results on publicly available datasets show that weights can significantly improve the accuracy performance while retaining the robust properties. Since the issue of determining an appropriate number of clusters, which is a primary input for many clustering methods is one of the significant challenges, we have used the same methodology to predict correct or the most suitable number of clusters as well. Among various methods, using internal validity indexes in conjunction with a suitable algorithm is one of the most popular way to determine the appropriate number of cluster. Thus, we use weighted consensus clustering along with four different indexes which are Silhouette (SH), Calinski-Harabasz (CH), Davies-Bouldin (DB), and Consensus (CI) indexes. Our experiment indicates that weighted consensus clustering together with chosen indexes is a useful method to determine right or the most appropriate number of clusters in comparison to individual clustering methods (e.g., k-means) and consensus clustering. Lastly, to decrease the variance of proposed weighted consensus clustering, we borrow the idea of Markowitz portfolio theory and implement its core idea to clustering domain. We aim to optimize the combination of individual clustering methods to minimize the variance of clustering accuracy. This is a new weighting policy to produce partition with a lower variance which might be crucial for a decision maker. Our study shows that using the idea of Markowitz portfolio theory will create a partition with a less variation in comparison to traditional consensus clustering and proposed weighted consensus clustering
    corecore