8 research outputs found
Knowledge Reused Outlier Detection
Tremendous efforts have been invested in the unsupervised outlier detection research, which is conducted on unlabeled data set with abnormality assumptions. With abundant related labeled data available as auxiliary information, we consider transferring the knowledge from the labeled source data to facilitate the unsupervised outlier detection on target data set. To fully make use of the source knowledge, the source data and target data are put together for joint clustering and outlier detection using the source data cluster structure as a constraint. To achieve this, the categorical utility function is employed to regularize the partitions of target data to be consistent with source data labels. With an augmented matrix, the problem is completely solved by a K-means - a based method with the rigid mathematical formulation and theoretical convergence guarantee. We have used four real-world data sets and eight outlier detection methods of different kinds for extensive experiments and comparison. The results demonstrate the effectiveness and significant improvements of the proposed methods in terms of outlier detection and cluster validity metrics. Moreover, the parameter analysis is provided as a practical guide, and noisy source label analysis proves that the proposed method can handle real applications where source labels can be noisy
Structural advances for pattern discovery in multi-relational databases
With ever-growing storage needs and drift towards very large relational storage settings, multi-relational data mining has become a prominent and pertinent field for discovering unique and interesting relational patterns. As a consequence, a whole suite of multi-relational data mining techniques is being developed. These techniques may either be extensions to the already existing single-table mining techniques or may be developed from scratch. For the traditionalists, single-table mining algorithms can be used to work on multi-relational settings by making inelegant and time consuming joins of all target relations. However, complex relational patterns cannot be expressed in a single-table format and thus, cannot be discovered. This work presents a new multi-relational frequent pattern mining algorithm termed Multi-Relational Frequent Pattern Growth (MRFP Growth). MRFP Growth is capable of mining multiple relations, linked with referential integrity, for frequent patterns that satisfy a user specified support threshold. Empirical results on MRFP Growth performance and its comparison with the state-of-the-art multirelational data mining algorithms like WARMR and Decentralized Apriori are discussed at length. MRFP Growth scores over the latter two techniques in number of patterns generated and speed. The realm of multi-relational clustering is also explored in this thesis. A multi-Relational Item Clustering approach based on Hypergraphs (RICH) is proposed. Experimentally RICH combined with MRFP Growth proves to be a competitive approach for clustering multi-relational data. The performance and iii quality of clusters generated by RICH are compared with other clustering algorithms. Finally, the thesis demonstrates the applied utility of the theoretical implications of the above mentioned algorithms in an application framework for auto-annotation of images in an image database. The system is called CoMMA which stands for Combining Multi-relational Multimedia for Associations
Structure-Preserved Unsupervised Domain Adaptation
Domain adaptation has been a primal approach to addressing the issues by lack of labels in many data mining tasks. Although considerable efforts have been devoted to domain adaptation with promising results, most existing work learns a classifier on a source domain and then predicts the labels for target data, where only the instances near the boundary determine the hyperplane and the whole structure information is ignored. Moreover, little work has been done regarding to multi-source domain adaptation. To that end, we develop a novel unsupervised domain adaptation framework, which ensures the whole structure of source domains is preserved to guide the target structure learning in a semi-supervised clustering fashion. To our knowledge, this is the first time when the domain adaptation problem is re-formulated as a semi-supervised clustering problem with target labels as missing values. Furthermore, by introducing an augmented matrix, a non-trivial solution is designed, which can be exactly mapped into a K-means-like optimization problem with modified distance function and update rule for centroids in an efficient way. Extensive experiments on several widely-used databases show the substantial improvements of our proposed approach over the state-of-the-art methods
Weighting Policies for Robust Unsupervised Ensemble Learning
The unsupervised ensemble learning, or consensus clustering, consists of finding the optimal com- bination strategy of individual partitions that is robust in comparison to the selection of an algorithmic clustering pool. Despite its strong properties, this approach assigns the same weight to the contribution of each clustering to the final solution. We propose a weighting policy for this problem that is based on internal clustering quality measures and compare against other modern approaches. Results on publicly available datasets show that weights can significantly improve the accuracy performance while retaining the robust properties. Since the issue of determining an appropriate number of clusters, which is a primary input for many clustering methods is one of the significant challenges, we have used the same methodology to predict correct or the most suitable number of clusters as well. Among various methods, using internal validity indexes in conjunction with a suitable algorithm is one of the most popular way to determine the appropriate number of cluster. Thus, we use weighted consensus clustering along with four different indexes which are Silhouette (SH), Calinski-Harabasz (CH), Davies-Bouldin (DB), and Consensus (CI) indexes. Our experiment indicates that weighted consensus clustering together with chosen indexes is a useful method to determine right or the most appropriate number of clusters in comparison to individual clustering methods (e.g., k-means) and consensus clustering. Lastly, to decrease the variance of proposed weighted consensus clustering, we borrow the idea of Markowitz portfolio theory and implement its core idea to clustering domain. We aim to optimize the combination of individual clustering methods to minimize the variance of clustering accuracy. This is a new weighting policy to produce partition with a lower variance which might be crucial for a decision maker. Our study shows that using the idea of Markowitz portfolio theory will create a partition with a less variation in comparison to traditional consensus clustering and proposed weighted consensus clustering
Voting-Based Consensus of Data Partitions
Over the past few years, there has been a renewed interest in the consensus
problem for ensembles of partitions. Recent work is primarily motivated by the
developments in the area of combining multiple supervised learners. Unlike the
consensus of supervised classifications, the consensus of data partitions is a
challenging problem due to the lack of globally defined cluster labels and to
the inherent difficulty of data clustering as an unsupervised learning problem.
Moreover, the true number of clusters may be unknown. A fundamental goal of
consensus methods for partitions is to obtain an optimal summary of an ensemble
and to discover a cluster structure with accuracy and robustness exceeding those
of the individual ensemble partitions.
The quality of the consensus partitions highly depends on the ensemble
generation mechanism and on the suitability of the consensus method for
combining the generated ensemble. Typically, consensus methods derive an
ensemble representation that is used as the basis for extracting the consensus
partition. Most ensemble representations circumvent the labeling problem. On
the other hand, voting-based methods establish direct parallels with consensus
methods for supervised classifications, by seeking an optimal relabeling of the
ensemble partitions and deriving an ensemble representation consisting of a
central aggregated partition. An important element of the voting-based
aggregation problem is the pairwise relabeling of an ensemble partition with
respect to a representative partition of the ensemble, which is refered to here
as the voting problem. The voting problem is commonly formulated as a weighted
bipartite matching problem.
In this dissertation, a general theoretical framework for the voting problem as
a multi-response regression problem is proposed. The problem is formulated as
seeking to estimate the uncertainties associated with the assignments of the
objects to the representative clusters, given their assignments to the clusters
of an ensemble partition. A new voting scheme, referred to as cumulative voting,
is derived as a special instance of the proposed regression formulation
corresponding to fitting a linear model by least squares estimation. The
proposed formulation reveals the close relationships between the underlying loss
functions of the cumulative voting and bipartite matching schemes. A useful
feature of the proposed framework is that it can be applied to model substantial
variability between partitions, such as a variable number of clusters.
A general aggregation algorithm with variants corresponding to
cumulative voting and bipartite matching is applied and a simulation-based
analysis is presented to compare the suitability of each scheme to different
ensemble generation mechanisms. The bipartite matching is found to be more
suitable than cumulative voting for a particular generation model, whereby each
ensemble partition is generated as a noisy permutation of an underlying
labeling, according to a probability of error. For ensembles with a variable
number of clusters, it is proposed that the aggregated partition be viewed as an
estimated distributional representation of the ensemble, on the basis of which,
a criterion may be defined to seek an optimally compressed consensus partition.
The properties and features of the proposed cumulative voting scheme are
studied. In particular, the relationship between cumulative voting and the
well-known co-association matrix is highlighted. Furthermore, an adaptive
aggregation algorithm that is suited for the cumulative voting scheme is
proposed. The algorithm aims at selecting the initial reference partition and
the aggregation sequence of the ensemble partitions the loss of mutual
information associated with the aggregated partition is minimized. In order to
subsequently extract the final consensus partition, an efficient agglomerative
algorithm is developed. The algorithm merges the aggregated clusters such that
the maximum amount of information is preserved. Furthermore, it allows the
optimal number of consensus clusters to be estimated.
An empirical study using several artificial and real-world datasets demonstrates
that the proposed cumulative voting scheme leads to discovering substantially
more accurate consensus partitions compared to bipartite matching, in the case
of ensembles with a relatively large or a variable number of clusters. Compared
to other recent consensus methods, the proposed method is found to be comparable
with or better than the best performing methods. Moreover, accurate estimates of
the true number of clusters are often achieved using cumulative voting, whereas
consistently poor estimates are achieved based on bipartite matching. The
empirical evidence demonstrates that the bipartite matching scheme is not
suitable for these types of ensembles
Ensemble and constrained clustering with applications
Diese Arbeit stellt neue Entwicklungen in Ensemble und Constrained Clustering vor und enthält die folgenden wesentlichen Beiträge: 1) Eine Vereinigung von Constrained und Ensemble Clustering in einem einheitlichen Framework. 2) Eine neue Methode zur Messung und Visualisierung der Variabilität von Ensembles. 3) Ein neues, Random Walker basiertes Verfahren für Ensemble Clustering. 4) Anwendung von Ensemble Clustering für Bildsegmentierung. 5) Eine neue Consensus-Funktion für das Ensemble Clustering Problem. Schließlich 6) Anwendung von Constrained Clustering zur Segmentierung von Nervenfasern in der Diffusions-Tensor-Bildgebung. In umfangreichen Experimenten wurden diese Verfahren getestet und ihre Überlegenheit gegenüber existierenden Methoden aus der Literatur demonstriert