1,083 research outputs found

    Entropy-based subspace clustering for mining numerical data.

    Get PDF
    by Cheng, Chun-hung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1999.Includes bibliographical references (leaves 72-76).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgments --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Six Tasks of Data Mining --- p.1Chapter 1.1.1 --- Classification --- p.2Chapter 1.1.2 --- Estimation --- p.2Chapter 1.1.3 --- Prediction --- p.2Chapter 1.1.4 --- Market Basket Analysis --- p.3Chapter 1.1.5 --- Clustering --- p.3Chapter 1.1.6 --- Description --- p.3Chapter 1.2 --- Problem Description --- p.4Chapter 1.3 --- Motivation --- p.5Chapter 1.4 --- Terminology --- p.7Chapter 1.5 --- Outline of the Thesis --- p.7Chapter 2 --- Survey on Previous Work --- p.8Chapter 2.1 --- Data Mining --- p.8Chapter 2.1.1 --- Association Rules and its Variations --- p.9Chapter 2.1.2 --- Rules Containing Numerical Attributes --- p.15Chapter 2.2 --- Clustering --- p.17Chapter 2.2.1 --- The CLIQUE Algorithm --- p.20Chapter 3 --- Entropy and Subspace Clustering --- p.24Chapter 3.1 --- Criteria of Subspace Clustering --- p.24Chapter 3.1.1 --- Criterion of High Density --- p.25Chapter 3.1.2 --- Correlation of Dimensions --- p.25Chapter 3.2 --- Entropy in a Numerical Database --- p.27Chapter 3.2.1 --- Calculation of Entropy --- p.27Chapter 3.3 --- Entropy and the Clustering Criteria --- p.29Chapter 3.3.1 --- Entropy and the Coverage Criterion --- p.29Chapter 3.3.2 --- Entropy and the Density Criterion --- p.31Chapter 3.3.3 --- Entropy and Dimensional Correlation --- p.33Chapter 4 --- The ENCLUS Algorithms --- p.35Chapter 4.1 --- Framework of the Algorithms --- p.35Chapter 4.2 --- Closure Properties --- p.37Chapter 4.3 --- Complexity Analysis --- p.39Chapter 4.4 --- Mining Significant Subspaces --- p.40Chapter 4.5 --- Mining Interesting Subspaces --- p.42Chapter 4.6 --- Example --- p.44Chapter 5 --- Experiments --- p.49Chapter 5.1 --- Synthetic Data --- p.49Chapter 5.1.1 --- Data Generation ´ؤ Hyper-rectangular Data --- p.49Chapter 5.1.2 --- Data Generation ´ؤ Linearly Dependent Data --- p.50Chapter 5.1.3 --- Effect of Changing the Thresholds --- p.51Chapter 5.1.4 --- Effectiveness of the Pruning Strategies --- p.53Chapter 5.1.5 --- Scalability Test --- p.53Chapter 5.1.6 --- Accuracy --- p.55Chapter 5.2 --- Real-life Data --- p.55Chapter 5.2.1 --- Census Data --- p.55Chapter 5.2.2 --- Stock Data --- p.56Chapter 5.3 --- Comparison with CLIQUE --- p.58Chapter 5.3.1 --- Subspaces with Uniform Projections --- p.60Chapter 5.4 --- Problems with Hyper-rectangular Data --- p.62Chapter 6 --- Miscellaneous Enhancements --- p.64Chapter 6.1 --- Extra Pruning --- p.64Chapter 6.2 --- Multi-resolution Approach --- p.65Chapter 6.3 --- Multi-threshold Approach --- p.68Chapter 7 --- Conclusion --- p.70Bibliography --- p.71Appendix --- p.77Chapter A --- Differential Entropy vs Discrete Entropy --- p.77Chapter A.1 --- Relation of Differential Entropy to Discrete Entropy --- p.78Chapter B --- Mining Quantitative Association Rules --- p.80Chapter B.1 --- Approaches --- p.81Chapter B.2 --- Performance --- p.82Chapter B.3 --- Final Remarks --- p.8

    Unsupervised Discovery and Representation of Subspace Trends in Massive Biomedical Datasets

    Get PDF
    The goal of this dissertation is to develop unsupervised algorithms for discovering previously unknown subspace trends in massive multivariate biomedical data sets without the benefit of prior information. A subspace trend is a sustained pattern of gradual/progressive changes within an unknown subset of feature dimensions. A fundamental challenge to subspace trend discovery is the presence of irrelevant data dimensions, noise, outliers, and confusion from multiple subspace trends driven by independent factors that are mixed in with each other. These factors can obscure the trends in traditional dimension reduction and projection based data visualizations. To overcome these limitations, we propose a novel graph-theoretic neighborhood similarity measure for sensing concordant progressive changes across data dimensions. Using this measure, we present an unsupervised algorithm for trend-relevant feature selection and visualization. Additionally, we propose to use an efficient online density-based representation to make the algorithm scalable for massive datasets. The representation not only assists in trend discovery, but also in cluster detection including rare populations. Our method has been successfully applied to diverse synthetic and real-world biomedical datasets, such as gene expression microarray and arbor morphology of neurons and microglia in brain tissue. Derived representations revealed biologically meaningful hidden subspace trend(s) that were obscured by irrelevant features and noise. Although our applications are mostly from the biomedical domain, the proposed algorithm is broadly applicable to exploratory analysis of high-dimensional data including visualization, hypothesis generation, knowledge discovery, and prediction in diverse other applications.Electrical and Computer Engineering, Department o

    An approach to clustering biological phenotypes /

    Get PDF
    Recently emerging approaches to high-throughput phenotyping have become important tools in unraveling the biological basis of agronomically and medically important phenotypes. These experiments produce very large sets of either low or high-dimensional data. Finding clusters in the entire space of high-dimensional data (HDD) is a challenging task, because the relative distances between any two objects converge to zero with increasing dimensionality. Additionally, real data may not be mathematically well behaved. Finally, many clusters are expected on biological grounds to be "natural" -- that is, to have irregular, overlapping boundaries in different subsets of the dimensions. More precisely, the natural clusters of the data could differ in shape, size, density, and dimensionality; and they might not be disjoint. In principle, clustering such data could be done by dimension reduction methods. However, these methods convert many dimensions to a smaller set of dimensions that make the clustering results difficult to interpret and may also lead to a significant loss of information. Another possible approach is to find subspaces (subsets of dimensions) in the entire data space of the HDD. However, the existing subspace methods don't discover natural clusters. Therefore, in this dissertation I propose a novel data preprocessing method, demonstrating that a group of phenotypes are interdependent, and propose a novel density-based subspace clustering algorithm for high-dimensional data, called Dynamic Locally Density Adaptive Scalable Subspace Clustering (DynaDASC). This algorithm is relatively locally density adaptive, scalable, dynamic, and nonmetric in nature, and discovers natural clusters.Dr. Toni Kazic, Dissertation Supervisor.|Includes vita.Includes bibliographical references (pages 62-73)

    HARP: A practical projected clustering algorithm

    Get PDF
    In high-dimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to Identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering accuracy can be seriously degraded If incorrect values are used. Unfortunately, in real situations, it is rarely possible for users to supply the parameter values accurately, which causes practical difficulties in applying these algorithms to real data. In this paper, we analyze the major challenges of projected clustering and suggest why these algorithms need to depend heavily on user parameters. Based on the analysis, we propose a new algorithm that exploits the clustering status to adjust the internal thresholds dynamically without the assistance of user parameters. According to the results of extensive experiments on real and synthetic data, the new method has excellent accuracy and usability. It outperformed the other algorithms even when correct parameter values were artificially supplied to them. The encouraging results suggest that projected clustering can be a practical tool for various kinds of real applications.published_or_final_versio

    Correlation Clustering

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The core step of the KDD process is the application of a Data Mining algorithm in order to produce a particular enumeration of patterns and relationships in large databases. Clustering is one of the major data mining techniques and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. This can serve to group customers with similar interests, or to group genes with related functionalities. Currently, a challenge for clustering-techniques are especially high dimensional feature-spaces. Due to modern facilities of data collection, real data sets usually contain many features. These features are often noisy or exhibit correlations among each other. However, since these effects in different parts of the data set are differently relevant, irrelevant features cannot be discarded in advance. The selection of relevant features must therefore be integrated into the data mining technique. Since about 10 years, specialized clustering approaches have been developed to cope with problems in high dimensional data better than classic clustering approaches. Often, however, the different problems of very different nature are not distinguished from one another. A main objective of this thesis is therefore a systematic classification of the diverse approaches developed in recent years according to their task definition, their basic strategy, and their algorithmic approach. We discern as main categories the search for clusters (i) w.r.t. closeness of objects in axis-parallel subspaces, (ii) w.r.t. common behavior (patterns) of objects in axis-parallel subspaces, and (iii) w.r.t. closeness of objects in arbitrarily oriented subspaces (so called correlation cluster). For the third category, the remaining parts of the thesis describe novel approaches. A first approach is the adaptation of density-based clustering to the problem of correlation clustering. The starting point here is the first density-based approach in this field, the algorithm 4C. Subsequently, enhancements and variations of this approach are discussed allowing for a more robust, more efficient, or more effective behavior or even find hierarchies of correlation clusters and the corresponding subspaces. The density-based approach to correlation clustering, however, is fundamentally unable to solve some issues since an analysis of local neighborhoods is required. This is a problem in high dimensional data. Therefore, a novel method is proposed tackling the correlation clustering problem in a global approach. Finally, a method is proposed to derive models for correlation clusters to allow for an interpretation of the clusters and facilitate more thorough analysis in the corresponding domain science. Finally, possible applications of these models are proposed and discussed.Knowledge Discovery in Databases (KDD) ist der Prozess der automatischen Extraktion von Wissen aus großen Datenmengen, das gültig, bisher unbekannt und potentiell nützlich für eine gegebene Anwendung ist. Der zentrale Schritt des KDD-Prozesses ist das Anwenden von Data Mining-Techniken, um nützliche Beziehungen und Zusammenhänge in einer aufbereiteten Datenmenge aufzudecken. Eine der wichtigsten Techniken des Data Mining ist die Cluster-Analyse (Clustering). Dabei sollen die Objekte einer Datenbank in Gruppen (Cluster) partitioniert werden, so dass Objekte eines Clusters möglichst ähnlich und Objekte verschiedener Cluster möglichst unähnlich zu einander sind. Hier können beispielsweise Gruppen von Kunden identifiziert werden, die ähnliche Interessen haben, oder Gruppen von Genen, die ähnliche Funktionalitäten besitzen. Eine aktuelle Herausforderung für Clustering-Verfahren stellen hochdimensionale Feature-Räume dar. Reale Datensätze beinhalten dank moderner Verfahren zur Datenerhebung häufig sehr viele Merkmale (Features). Teile dieser Merkmale unterliegen oft Rauschen oder Abhängigkeiten und können meist nicht im Vorfeld ausgesiebt werden, da diese Effekte in Teilen der Datenbank jeweils unterschiedlich ausgeprägt sind. Daher muss die Wahl der Features mit dem Data-Mining-Verfahren verknüpft werden. Seit etwa 10 Jahren werden vermehrt spezialisierte Clustering-Verfahren entwickelt, die mit den in hochdimensionalen Feature-Räumen auftretenden Problemen besser umgehen können als klassische Clustering-Verfahren. Hierbei wird aber oftmals nicht zwischen den ihrer Natur nach im Einzelnen sehr unterschiedlichen Problemen unterschieden. Ein Hauptanliegen der Dissertation ist daher eine systematische Einordnung der in den letzten Jahren entwickelten sehr diversen Ansätze nach den Gesichtspunkten ihrer jeweiligen Problemauffassung, ihrer grundlegenden Lösungsstrategie und ihrer algorithmischen Vorgehensweise. Als Hauptkategorien unterscheiden wir hierbei die Suche nach Clustern (1.) hinsichtlich der Nähe von Cluster-Objekten in achsenparallelen Unterräumen, (2.) hinsichtlich gemeinsamer Verhaltensweisen (Mustern) von Cluster-Objekten in achsenparallelen Unterräumen und (3.) hinsichtlich der Nähe von Cluster-Objekten in beliebig orientierten Unterräumen (sogenannte Korrelations-Cluster). Für die dritte Kategorie sollen in den weiteren Teilen der Dissertation innovative Lösungsansätze entwickelt werden. Ein erster Lösungsansatz basiert auf einer Erweiterung des dichte-basierten Clustering auf die Problemstellung des Korrelations-Clustering. Den Ausgangspunkt bildet der erste dichtebasierte Ansatz in diesem Bereich, der Algorithmus 4C. Anschließend werden Erweiterungen und Variationen dieses Ansatzes diskutiert, die robusteres, effizienteres oder effektiveres Verhalten aufweisen oder sogar Hierarchien von Korrelations-Clustern und den entsprechenden Unterräumen finden. Die dichtebasierten Korrelations-Cluster-Verfahren können allerdings einige Probleme grundsätzlich nicht lösen, da sie auf der Analyse lokaler Nachbarschaften beruhen. Dies ist in hochdimensionalen Feature-Räumen problematisch. Daher wird eine weitere Neuentwicklung vorgestellt, die das Korrelations-Cluster-Problem mit einer globalen Methode angeht. Schließlich wird eine Methode vorgestellt, die Cluster-Modelle für Korrelationscluster ableitet, so dass die gefundenen Cluster interpretiert werden können und tiefergehende Untersuchungen in der jeweiligen Fachdisziplin zielgerichtet möglich sind. Mögliche Anwendungen dieser Modelle werden abschließend vorgestellt und untersucht

    Binding Affinity and Specificity of SH2 Domain Interactions in Receptor Tyrosine Kinase Signaling Networks

    Get PDF
    Receptor tyrosine kinase (RTK) signaling mechanisms play a central role in intracellular signaling and control development of multicellular organisms, cell growth, cell migration, and programmed cell death. Dysregulation of these signaling mechanisms results in defects of development and diseases such as cancer. Control of this network relies on the specificity and selectivity of Src Homology 2 (SH2) domain interactions with phosphorylated target peptides. In this work, we review and identify the limitations of current quantitative understanding of SH2 domain interactions, and identify severe limitations in accuracy and availability of SH2 domain interaction data. We propose a framework to address some of these limitations and present new results which improve the quality and accuracy of currently available data. Furthermore, we supplement published results with a large body of negative interactions of high-confidence extracted from rejected data, allowing for improved modeling and prediction of SH2 interactions. We present and analyze new experimental results for the dynamic response of downstream signaling proteins in response to RTK signaling. Our data identify differences in downstream response depending on the character and dose of the receptor stimulus, which has implications for previous studies using high-dose stimulation. We review some of the methods used in this work, focusing on pitfalls of clustering biological data, and address the high-dimensional nature of biological data from high-throughput experiments, the failure to consider more than one clustering method for a given problem, and the difficulty in determining whether clustering has produced meaningful results

    Unsupervised learning on social data

    Get PDF

    Visual Analysis of High-Dimensional Point Clouds using Topological Abstraction

    Get PDF
    This thesis is about visualizing a kind of data that is trivial to process by computers but difficult to imagine by humans because nature does not allow for intuition with this type of information: high-dimensional data. Such data often result from representing observations of objects under various aspects or with different properties. In many applications, a typical, laborious task is to find related objects or to group those that are similar to each other. One classic solution for this task is to imagine the data as vectors in a Euclidean space with object variables as dimensions. Utilizing Euclidean distance as a measure of similarity, objects with similar properties and values accumulate to groups, so-called clusters, that are exposed by cluster analysis on the high-dimensional point cloud. Because similar vectors can be thought of as objects that are alike in terms of their attributes, the point cloud\''s structure and individual cluster properties, like their size or compactness, summarize data categories and their relative importance. The contribution of this thesis is a novel analysis approach for visual exploration of high-dimensional point clouds without suffering from structural occlusion. The work is based on implementing two key concepts: The first idea is to discard those geometric properties that cannot be preserved and, thus, lead to the typical artifacts. Topological concepts are used instead to shift away the focus from a point-centered view on the data to a more structure-centered perspective. The advantage is that topology-driven clustering information can be extracted in the data\''s original domain and be preserved without loss in low dimensions. The second idea is to split the analysis into a topology-based global overview and a subsequent geometric local refinement. The occlusion-free overview enables the analyst to identify features and to link them to other visualizations that permit analysis of those properties not captured by the topological abstraction, e.g. cluster shape or value distributions in particular dimensions or subspaces. The advantage of separating structure from data point analysis is that restricting local analysis only to data subsets significantly reduces artifacts and the visual complexity of standard techniques. That is, the additional topological layer enables the analyst to identify structure that was hidden before and to focus on particular features by suppressing irrelevant points during local feature analysis. This thesis addresses the topology-based visual analysis of high-dimensional point clouds for both the time-invariant and the time-varying case. Time-invariant means that the points do not change in their number or positions. That is, the analyst explores the clustering of a fixed and constant set of points. The extension to the time-varying case implies the analysis of a varying clustering, where clusters appear as new, merge or split, or vanish. Especially for high-dimensional data, both tracking---which means to relate features over time---but also visualizing changing structure are difficult problems to solve
    corecore