394 research outputs found

    Spatially-Aware Comparison and Consensus for Clusterings

    Full text link
    This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Euclidean clustering, and is both simple and efficient. All of our results apply to both soft and hard clusterings. We accompany these algorithms with a detailed experimental evaluation that demonstrates the efficiency and quality of our techniques.Comment: 12 Pages, 9 figures, Proceedings of 2011 Siam International Conference on Data Minin

    Doctor of Philosophy

    Get PDF
    dissertationWith the tremendous growth of data produced in the recent years, it is impossible to identify patterns or test hypotheses without reducing data size. Data mining is an area of science that extracts useful information from the data by discovering patterns and structures present in the data. In this dissertation, we will largely focus on clustering which is often the first step in any exploratory data mining task, where items that are similar to each other are grouped together, making downstream data analysis robust. Different clustering techniques have different strengths, and the resulting groupings provide different perspectives on the data. Due to the unsupervised nature i.e., the lack of domain experts who can label the data, validation of results is very difficult. While there are measures that compute "goodness" scores for clustering solutions as a whole, there are few methods that validate the assignment of individual data items to their clusters. To address these challenges we focus on developing a framework that can generate, compare, combine, and evaluate different solutions to make more robust and significant statements about the data. In the first part of this dissertation, we present fast and efficient techniques to generate and combine different clustering solutions. We build on some recent ideas on efficient representations of clusters of partitions to develop a well founded metric that is spatially aware to compare clusterings. With the ability to compare clusterings, we describe a heuristic to combine different solutions to produce a single high quality clustering. We also introduce a Markov chain Monte Carlo approach to sample different clusterings from the entire landscape to provide the users with a variety of choices. In the second part of this dissertation, we build certificates for individual data items and study their influence on effective data reduction. We present a geometric approach by defining regions of influence for data items and clusters and use this to develop adaptive sampling techniques to speedup machine learning algorithms. This dissertation is therefore a systematic approach to study the landscape of clusterings in an attempt to provide a better understanding of the data

    Power to the points: validating data memberships in clusterings

    Get PDF
    pre-printIn this paper, we present a method to attach affinity scores to the implicit labels of individual points in a clustering. The affinity scores capture the confidence level of the cluster that claims to "own" the point. We demonstrate that these scores accurately capture the quality of the label assigned to the point. We also show further applications of these scores to estimate global measures of clustering quality, as well as accelerate clustering algorithms by orders of magnitude using active selection based on affinity. This method is very general and applies to clusterings derived from any geometric source. It lends itself to easy visualization and can prove useful as part of an interactive visual analytics framework. It is also efficient: assigning an affinity score to a point depends only polynomially on the number of clusters and is independent both of the size and dimensionality of the data. It is based on techniques from the theory of interpolation, coupled with sampling and estimation algorithms from high dimensional computational geometry

    Constrained Distance Based Clustering for Satellite Image Time-Series

    Get PDF
    International audienceThe advent of high-resolution instruments for time-series sampling poses added complexity for the formal definition of thematic classes in the remote sensing domain-required by supervised methods-while unsupervised methods ignore expert knowledge and intuition. Constrained clustering is becoming an increasingly popular approach in data mining because it offers a solution to these problems, however, its application in remote sensing is relatively unknown. This article addresses this divide by adapting publicly available constrained clustering implementations to use the dynamic time warping (DTW) dissimilarity measure, which is sometimes used for time-series analysis. A comparative study is presented, in which their performance is evaluated (using both DTW and Euclidean distances). It is found that adding constraints to the clustering problem results in an increase in accuracy when compared to unconstrained clustering. The output of such algorithms are homogeneous in spatially defined regions. Declarative approaches and k-Means based algorithms are simple to apply, requiring little or no choice of parameter values. Spectral methods, however, require careful tuning, which is unrealistic in a semi-supervised setting, although they offer the highest accuracy. These conclusions were drawn from two applications: crop clustering using 11 multi-spectral Landsat images non-uniformly sampled over a period of eight months in 2007; and tree-cut detection using 10 NDVI Sentinel-2 images non-uniformly sampled between 2016 and 2018

    Image Based Biomarkers from Magnetic Resonance Modalities: Blending Multiple Modalities, Dimensions and Scales.

    Get PDF
    The successful analysis and processing of medical imaging data is a multidisciplinary work that requires the application and combination of knowledge from diverse fields, such as medical engineering, medicine, computer science and pattern classification. Imaging biomarkers are biologic features detectable by imaging modalities and their use offer the prospect of more efficient clinical studies and improvement in both diagnosis and therapy assessment. The use of Dynamic Contrast Enhanced Magnetic Resonance Imaging (DCE-MRI) and its application to the diagnosis and therapy has been extensively validated, nevertheless the issue of an appropriate or optimal processing of data that helps to extract relevant biomarkers to highlight the difference between heterogeneous tissue still remains. Together with DCE-MRI, the data extracted from Diffusion MRI (DWI-MR and DTI-MR) represents a promising and complementary tool. This project initially proposes the exploration of diverse techniques and methodologies for the characterization of tissue, following an analysis and classification of voxel-level time-intensity curves from DCE-MRI data mainly through the exploration of dissimilarity based representations and models. We will explore metrics and representations to correlate the multidimensional data acquired through diverse imaging modalities, a work which starts with the appropriate elastic registration methodology between DCE-MRI and DWI- MR on the breast and its corresponding validation. It has been shown that the combination of multi-modal MRI images improve the discrimination of diseased tissue. However the fusion of dissimilar imaging data for classification and segmentation purposes is not a trivial task, there is an inherent difference in information domains, dimensionality and scales. This work also proposes a multi-view consensus clustering methodology for the integration of multi-modal MR images into a unified segmentation of tumoral lesions for heterogeneity assessment. Using a variety of metrics and distance functions this multi-view imaging approach calculates multiple vectorial dissimilarity-spaces for each one of the MRI modalities and makes use of the concepts behind cluster ensembles to combine a set of base unsupervised segmentations into an unified partition of the voxel-based data. The methodology is specially designed for combining DCE-MRI and DTI-MR, for which a manifold learning step is implemented in order to account for the geometric constrains of the high dimensional diffusion information.The successful analysis and processing of medical imaging data is a multidisciplinary work that requires the application and combination of knowledge from diverse fields, such as medical engineering, medicine, computer science and pattern classification. Imaging biomarkers are biologic features detectable by imaging modalities and their use offer the prospect of more efficient clinical studies and improvement in both diagnosis and therapy assessment. The use of Dynamic Contrast Enhanced Magnetic Resonance Imaging (DCE-MRI) and its application to the diagnosis and therapy has been extensively validated, nevertheless the issue of an appropriate or optimal processing of data that helps to extract relevant biomarkers to highlight the difference between heterogeneous tissue still remains. Together with DCE-MRI, the data extracted from Diffusion MRI (DWI-MR and DTI-MR) represents a promising and complementary tool. This project initially proposes the exploration of diverse techniques and methodologies for the characterization of tissue, following an analysis and classification of voxel-level time-intensity curves from DCE-MRI data mainly through the exploration of dissimilarity based representations and models. We will explore metrics and representations to correlate the multidimensional data acquired through diverse imaging modalities, a work which starts with the appropriate elastic registration methodology between DCE-MRI and DWI- MR on the breast and its corresponding validation. It has been shown that the combination of multi-modal MRI images improve the discrimination of diseased tissue. However the fusion of dissimilar imaging data for classification and segmentation purposes is not a trivial task, there is an inherent difference in information domains, dimensionality and scales. This work also proposes a multi-view consensus clustering methodology for the integration of multi-modal MR images into a unified segmentation of tumoral lesions for heterogeneity assessment. Using a variety of metrics and distance functions this multi-view imaging approach calculates multiple vectorial dissimilarity-spaces for each one of the MRI modalities and makes use of the concepts behind cluster ensembles to combine a set of base unsupervised segmentations into an unified partition of the voxel-based data. The methodology is specially designed for combining DCE-MRI and DTI-MR, for which a manifold learning step is implemented in order to account for the geometric constrains of the high dimensional diffusion information

    Human activity is altering the world’s zoogeographical regions

    Get PDF
    Zoogeographical regions, or zooregions, are areas of the Earth defined by species pools that reflect ecological, historical, and evolutionary processes acting over millions of years. Consequently, researchers have assumed that zooregions are robust and unlikely to change on a human timescale. However, the increasing number of human-mediated introductions and extinctions can challenge this assumption. By delineating zooregions with a network-based algorithm, here we show that introductions and extinctions are altering the zooregions we know today. Introductions are homogenising the Eurasian and African mammal zooregions and also triggering less intuitive effects in birds and amphibians, such as dividing and redefining zooregions representing the Old and New World. Furthermore, these Old and New World amphibian zooregions are no longer detected when considering introductions plus extinctions of the most threatened species. Our findings highlight the profound and far-reaching impact of human activity and call for identifying and protecting the uniqueness of biotic assemblages

    Leveraging Structural Flexibility to Predict Protein Function

    Get PDF
    Proteins are essentially versatile and flexible molecules and understanding protein function plays a fundamental role in understanding biological systems. Protein structure comparisons are widely used for revealing protein function. However,with rigidity or partial rigidity assumption, most existing comparison methods do not consider conformational flexibility in protein structures. To address this issue, this thesis seeks to develop algorithms for flexible structure comparisons to predict one specific aspect of protein function, binding specificity. Given conformational samples as flexibility representation, we focus on two predictive problems related to specificity: aggregate prediction and individual prediction.For aggregate prediction, we have designed FAVA (Flexible Aggregate Volumetric Analysis). FAVA is the first conformationally general method to compare proteins with identical folds but different specificities. FAVA is able to correctly categorize members of protein superfamilies and to identify influential amino acids that cause different specificities. A second method PEAP (Point-based Ensemble for Aggregate Prediction) employs ensemble clustering techniques from many base clustering to predict binding specificity. This method incorporates structural motions of functional substructures and is capable of mitigating prediction errors.For individual prediction, the first method is an atomic point representation for representing flexibilities in the binding cavity. This representation is able to predict binding specificity on each protein conformation with high accuracy, and it is the first to analyze maps of binding cavity conformations that describe proteins with different specificities. Our second method introduces a volumetric lattice representation. This representation localizes solvent-accessible shape of the binding cavity by computing cavity volume in each user-defined space. It proves to be more informative than point-based representations. Last but not least, we discuss a structure-independent representation. This representation builds a lattice model on protein electrostatic isopotentials. This is the first known method to predict binding specificity explicitly from the perspective of electrostatic fields.The methods presented in this thesis incorporate the variety of protein conformations into the analysis of protein ligand binding, and provide more views on flexible structure comparisons and structure-based function annotation of molecular design

    Searching and mining in enriched geo-spatial data

    Get PDF
    The emergence of new data collection mechanisms in geo-spatial applications paired with a heightened tendency of users to volunteer information provides an ever-increasing flow of data of high volume, complex nature, and often associated with inherent uncertainty. Such mechanisms include crowdsourcing, automated knowledge inference, tracking, and social media data repositories. Such data bearing additional information from multiple sources like probability distributions, text or numerical attributes, social context, or multimedia content can be called multi-enriched. Searching and mining this abundance of information holds many challenges, if all of the data's potential is to be released. This thesis addresses several major issues arising in that field, namely path queries using multi-enriched data, trend mining in social media data, and handling uncertainty in geo-spatial data. In all cases, the developed methods have made significant contributions and have appeared in or were accepted into various renowned international peer-reviewed venues. A common use of geo-spatial data is path queries in road networks where traditional methods optimise results based on absolute and ofttimes singular metrics, i.e., finding the shortest paths based on distance or the best trade-off between distance and travel time. Integrating additional aspects like qualitative or social data by enriching the data model with knowledge derived from sources as mentioned above allows for queries that can be issued to fit a broader scope of needs or preferences. This thesis presents two implementations of incorporating multi-enriched data into road networks. In one case, a range of qualitative data sources is evaluated to gain knowledge about user preferences which is subsequently matched with locations represented in a road network and integrated into its components. Several methods are presented for highly customisable path queries that incorporate a wide spectrum of data. In a second case, a framework is described for resource distribution with reappearance in road networks to serve one or more clients, resulting in paths that provide maximum gain based on a probabilistic evaluation of available resources. Applications for this include finding parking spots. Social media trends are an emerging research area giving insight in user sentiment and important topics. Such trends consist of bursts of messages concerning a certain topic within a time frame, significantly deviating from the average appearance frequency of the same topic. By investigating the dissemination of such trends in space and time, this thesis presents methods to classify trend archetypes to predict future dissemination of a trend. Processing and querying uncertain data is particularly demanding given the additional knowledge required to yield results with probabilistic guarantees. Since such knowledge is not always available and queries are not easily scaled to larger datasets due to the #P-complete nature of the problem, many existing approaches reduce the data to a deterministic representation of its underlying model to eliminate uncertainty. However, data uncertainty can also provide valuable insight into the nature of the data that cannot be represented in a deterministic manner. This thesis presents techniques for clustering uncertain data as well as query processing, that take the additional information from uncertainty models into account while preserving scalability using a sampling-based approach, while previous approaches could only provide one of the two. The given solutions enable the application of various existing clustering techniques or query types to a framework that manages the uncertainty.Das Erscheinen neuer Methoden zur Datenerhebung in räumlichen Applikationen gepaart mit einer erhöhten Bereitschaft der Nutzer, Daten über sich preiszugeben, generiert einen stetig steigenden Fluss von Daten in großer Menge, komplexer Natur, und oft gepaart mit inhärenter Unsicherheit. Beispiele für solche Mechanismen sind Crowdsourcing, automatisierte Wissensinferenz, Tracking, und Daten aus sozialen Medien. Derartige Daten, angereichert mit mit zusätzlichen Informationen aus verschiedenen Quellen wie Wahrscheinlichkeitsverteilungen, Text- oder numerische Attribute, sozialem Kontext, oder Multimediainhalten, werden als multi-enriched bezeichnet. Suche und Datamining in dieser weiten Datenmenge hält viele Herausforderungen bereit, wenn das gesamte Potenzial der Daten genutzt werden soll. Diese Arbeit geht auf mehrere große Fragestellungen in diesem Feld ein, insbesondere Pfadanfragen in multi-enriched Daten, Trend-mining in Daten aus sozialen Netzwerken, und die Beherrschung von Unsicherheit in räumlichen Daten. In all diesen Fällen haben die entwickelten Methoden signifikante Forschungsbeiträge geleistet und wurden veröffentlicht oder angenommen zu diversen renommierten internationalen, von Experten begutachteten Konferenzen und Journals. Ein gängiges Anwendungsgebiet räumlicher Daten sind Pfadanfragen in Straßennetzwerken, wo traditionelle Methoden die Resultate anhand absoluter und oft auch singulärer Maße optimieren, d.h., der kürzeste Pfad in Bezug auf die Distanz oder der beste Kompromiss zwischen Distanz und Reisezeit. Durch die Integration zusätzlicher Aspekte wie qualitativer Daten oder Daten aus sozialen Netzwerken als Anreicherung des Datenmodells mit aus diesen Quellen abgeleitetem Wissen werden Anfragen möglich, die ein breiteres Spektrum an Anforderungen oder Präferenzen erfüllen. Diese Arbeit präsentiert zwei Ansätze, solche multi-enriched Daten in Straßennetze einzufügen. Zum einen wird eine Reihe qualitativer Datenquellen ausgewertet, um Wissen über Nutzerpräferenzen zu generieren, welches darauf mit Örtlichkeiten im Straßennetz abgeglichen und in das Netz integriert wird. Diverse Methoden werden präsentiert, die stark personalisierbare Pfadanfragen ermöglichen, die ein weites Spektrum an Daten mit einbeziehen. Im zweiten Fall wird ein Framework präsentiert, das eine Ressourcenverteilung im Straßennetzwerk modelliert, bei der einmal verbrauchte Ressourcen erneut auftauchen können. Resultierende Pfade ergeben einen maximalen Ertrag basieren auf einer probabilistischen Evaluation der verfügbaren Ressourcen. Eine Anwendung ist die Suche nach Parkplätzen. Trends in sozialen Medien sind ein entstehendes Forscchungsgebiet, das Einblicke in Benutzerverhalten und wichtige Themen zulässt. Solche Trends bestehen aus großen Mengen an Nachrichten zu einem bestimmten Thema innerhalb eines Zeitfensters, so dass die Auftrittsfrequenz signifikant über den durchschnittlichen Level liegt. Durch die Untersuchung der Fortpflanzung solcher Trends in Raum und Zeit präsentiert diese Arbeit Methoden, um Trends nach Archetypen zu klassifizieren und ihren zukünftigen Weg vorherzusagen. Die Anfragebearbeitung und Datamining in unsicheren Daten ist besonders herausfordernd, insbesondere im Hinblick auf das notwendige Zusatzwissen, um Resultate mit probabilistischen Garantien zu erzielen. Solches Wissen ist nicht immer verfügbar und Anfragen lassen sich aufgrund der \P-Vollständigkeit des Problems nicht ohne Weiteres auf größere Datensätze skalieren. Dennoch kann Datenunsicherheit wertvollen Einblick in die Struktur der Daten liefern, der mit deterministischen Methoden nicht erreichbar wäre. Diese Arbeit präsentiert Techniken zum Clustering unsicherer Daten sowie zur Anfragebearbeitung, die die Zusatzinformation aus dem Unsicherheitsmodell in Betracht ziehen, jedoch gleichzeitig die Skalierbarkeit des Ansatzes auf große Datenmengen sicherstellen

    Segmentación multi-modal de imágenes RGB-D a partir de mapas de apariencia y de profundidad geométrica

    Get PDF
    Classical image segmentation algorithms exploit the detection of similarities and discontinuities of different visual cues to define and differentiate multiple regions of interest in images. However, due to the high variability and uncertainty of image data, producing accurate results is difficult. In other words, segmentation based just on color is often insufficient for a large percentage of real-life scenes. This work presents a novel multi-modal segmentation strategy that integrates depth and appearance cues from RGB-D images by building a hierarchical region-based representation, i.e., a multi-modal segmentation tree (MM-tree). For this purpose, RGB-D image pairs are represented in a complementary fashion by different segmentation maps. Based on color images, a color segmentation tree (C-tree) is created to obtain segmented and over-segmented maps. From depth images, two independent segmentation maps are derived by computing planar and 3D edge primitives. Then, an iterative region merging process can be used to locally group the previously obtained maps into the MM-tree. Finally, the top emerging MM-tree level coherently integrates the available information from depth and appearance maps. The experiments were conducted using the NYU-Depth V2 RGB-D dataset, which demonstrated the competitive results of our strategy compared to state-of-the-art segmentation methods. Specifically, using test images, our method reached average scores of 0.56 in Segmentation Covering and 2.13 in Variation of Information.Los algoritmos clásicos de segmentación de imágenes explotan la detección de similitudes y discontinuidades en diferentes señales visuales, para definir regiones de interés en imágenes. Sin embargo, debido a la alta variabilidad e incertidumbre en los datos de imagen, se dificulta generar resultados acertados. En otras palabras, la segmentación basada solo en color a menudo no es suficiente para un gran porcentaje de escenas reales. Este trabajo presenta una nueva estrategia de segmentación multi-modal que integra señales de profundidad y apariencia desde imágenes RGB-D, por medio de una representación jerárquica basada en regiones, es decir, un árbol de segmentación multi-modal (MM-tree). Para ello, la imagen RGB-D es descrita de manera complementaria por diferentes mapas de segmentación. A partir de la imagen de color, se implementa un árbol de segmentación de color (C-tree) para obtener mapas de segmentación y sobre-segmentación. Desde de la imagen de profundidad, se derivan dos mapas de segmentación independientes, los cuales se basan en el cálculo de primitivas de planos y de bordes 3D. Seguidamente, un proceso de fusión jerárquico de regiones permite agrupar de manera local los mapas obtenidos anteriormente en el MM-tree. Por último, el nivel superior emergente del MM-tree integra coherentemente la información disponible en los mapas de profundidad y apariencia. Los experimentos se realizaron con el conjunto de imágenes RGB-D del NYU-Depth V2, evidenciando resultados competitivos, con respecto a los métodos de segmentación del estado del arte. Específicamente, en las imágenes de prueba, se obtuvieron puntajes promedio de 0.56 en la medida de Segmentation Covering y 2.13 en Variation of Information
    corecore